Deep Learning For Mathematical Functions

Similar documents
Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

arxiv: v2 [cs.cl] 1 Jan 2019

Lecture 5 Neural models for NLP

ECE521 Lectures 9 Fully Connected Neural Networks

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Bringing machine learning & compositional semantics together: central concepts

Deep Learning for NLP Part 2

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis. July 31, 2014

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Sequence Modeling with Neural Networks

Neural Networks for NLP. COMP-599 Nov 30, 2016

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Learning to translate with neural networks. Michael Auli

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Random Coattention Forest for Question Answering

An overview of word2vec

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning for NLP

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Ch.6 Deep Feedforward Networks (2/3)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

arxiv: v1 [cs.cl] 21 May 2017

Improved Learning through Augmenting the Loss

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Memory-Augmented Attention Model for Scene Text Recognition

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Introduction to Deep Learning

DISTRIBUTIONAL SEMANTICS

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

CSC321 Lecture 4: Learning a Classifier

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Machine Learning Basics

Introduction to Deep Neural Networks

Lecture 11 Recurrent Neural Networks I

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Deep Learning for NLP

Multitask Learning and Extensions of Dynamic Coattention Network

ECE521 Lecture 7/8. Logistic Regression

Lecture 11 Recurrent Neural Networks I

CSC321 Lecture 9 Recurrent neural nets

Applied Natural Language Processing

Natural Language Processing

Lecture Notes on Inductive Definitions

Deep Learning (CNNs)

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Deep Neural Networks (1) Hidden layers; Back-propagation

Natural Language Processing with Deep Learning CS224N/Ling284

Encoder Based Lifelong Learning - Supplementary materials

Generative Models for Sentences

word2vec Parameter Learning Explained

Seq2Tree: A Tree-Structured Extension of LSTM Network

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural networks COMS 4771

Lecture 15: Recurrent Neural Nets

Slide credit from Hung-Yi Lee & Richard Socher

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 5: Multilayer Perceptrons

Lecture Notes on Inductive Definitions

Conditional Language modeling with attention

GloVe: Global Vectors for Word Representation 1

text classification 3: neural networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Gated Recurrent Neural Tensor Network

Natural Language Processing and Recurrent Neural Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

The Noisy Channel Model and Markov Models

CSC321 Lecture 15: Recurrent Neural Networks

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Neural Architectures for Image, Language, and Speech Processing

CSC321 Lecture 4: Learning a Classifier

Neural networks (NN) 1

Feedforward Neural Networks

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Experiments on the Consciousness Prior

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

(

Generating Sequences with Recurrent Neural Networks

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Neural Networks: Backpropagation

Deep Neural Networks (1) Hidden layers; Back-propagation

Recurrent Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Learning Recurrent Networks 2/28/2018

Bayesian Networks (Part I)

Assignment 1. Learning distributed word representations. Jimmy Ba

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Recurrent and Recursive Networks

CSCI567 Machine Learning (Fall 2018)

Transcription:

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Deep Learning For Mathematical Functions Kesinee Ninsuwan Institute of Computational and Mathematical Engineering Stanford University Stanford, CA 94305 eveve@stanford.edu Abstract In this project, we are interested in applying deep learning technique to predict the output of a mathematical function given its inputs. In particular, we apply Recursive Neural Networks RNN) to learn vector representation of inputs and predict function output. In this project, we restrict ourselves to integer-valued functions. We show the results for 1-layer RNN and 2-layer RNN with one linear function and one nonlinear function. The results show that 1-layer RNN perform better than 2-layer RNN in both cases. Moreover, 1-layer RNN gives higher accuracy in linear function than in nonlinear function. 1 Introduction Deep learning has proven to be successful in Natural Language Processing. Vector representation of words could capture both semantic and syntactic behavior of the text. It is a powerful tool in many applications ranging from simple to very complex tasks. Some examples of these tasks are spell checking, keyword search, machine translation, and question answering. Our interest in this project lies on applying deep learning technique in studying mathematical functions. As an analogy to vector representation of words, we attempt to learn vector representation of inputs of a function to predict the true function value given inputs. In the work of [6], Recursive Neural Networks RNN) shows a very good result in discovering mathematical identities. Motivated by this work, we explore the performance of RNN in our task. The following is the outline of this report. Section 2 describes the problem precisely as well as the evaluation. In Section 3, we present the model that will be applied to solve the problem. Section 4 discusses the experiments and results of the model. In Section 5, we discuss some restriction of the model. And Section 6 gives the conclusion. 2 Problem Statement Given the inputs u, v of a function f, the task is to predict an accurate output fu, v). In this project, we restrict the functions of interest to be integer-valued functions. The training data consists of a set of data points in the form u, v, fu, v)), where u, v are input of function f and fu, v) is the output of function. The training set, dev set, and test set are generated by simple matlab code. Definition 2.1 For any input u, v, let ˆfu, v) be the output from the model and fu, v) be the true value of the function. The model gives a correct output for u, v if and only if ˆfu, v) = fu, v) 1

054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 3 Technical Approach and Models In this project, we restrict the functions of interest to be integer-valued functions that take 2 input integers such that the input u, v and output fu, v) are in [0, 999]. For each function, we apply Recursive Neural Networks RNN) with ReLU activation layers) and one softmax layer. The model gives three predicted output ŷ 1), ŷ 2), ŷ 3). Each ŷ i) R 10 gives the probability that digit i of function value is the numbers 0, 1,..., 9. Let z i) = argmax ŷ i). Then the predicted output is ˆfu, v) = 100 z 1) + 10 z 2) + z 3). Here the cost function is Cross Entropy loss: CE y 1), y 2), y 3), ŷ 1), ŷ 2), ŷ 3)) = where y j) R 10 is the one-hot label vector. 3 10 j=1 i=1 y j) i log We use two RNN models in this project, one using one ReLU activation layer and another using two ReLU activation layers. For one ReLU layer, we have [ ] ) h 1) = max W 1) LLeft + b L 1), 0 Right ŷ j) = softmax U j) h 1) + b s,j)), for j = 1, 2, 3 where L i R d, i = 0, 1,..., 999. Here W 1) R d 2d, b 1) R d, U j) R 10 d, b s,j R 10. Figure 1 shows the diagram of RNN used in this model. Figure 1: RNN with one ReLu layer and one softmax layer. For 2 ReLU layers, we use the same loss function. The networks can be written as [ ] ) h 1) = max W 1) LLeft + b L 1), 0 Right ) h 2 = max W 2) h 1) + b 2), 0 ŷ j) = softmax U j) h 2) + b s,j)), for j = 1, 2, 3 where W 2) R middle d, b 2) R middle, and U j) R 10 middle. Note that one advantage of this model is that we do not assume any linearity of the functions. ŷ j) i ), 2

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 4 Experiments and Results We show the result for the model with one linear function fu, v) = u+v and one nonlinear function fu, v) = u 2 + v. The training set contains 60, 000 data points. The dev set and test set contain 20, 000 data points each. We train the model by picking the data points at random from training set with epoch schedules. To train each model, we first train over different epochs, and choose epoch that gives highest accuracy for dev set. For 1-layer RNN, we then train with different word vector dimension d wvecdim) and test with dev set. We choose wvecdim that gives highest accuracy. For 2-layer RNN, we train to pick dimension of middle activation layer that gives highest accuracy. After we find optimal values for parameters for each model, we test it with test set to evaluate the performance. First we show the result for linear function fu, v) = u+v. Figure 2 left) shows the plot of accuracy vs epoch. It shows that the accuracy of dev set increases with epoch. Due to the limit in computation power, we take 100 epochs as optimal value and use it with different values of dimension d of word vector for inputs. The plot of accuracy of dev set vs word vector dimension wvecdim) in Figure 2 right) shows that the accuracy increases with wvecdim. However, it tends to stay about the same after wvecdim = 45. Therefore, for 1-layer RNN with fu, v) = u + v, we train the model with epoch = 100 and wvecdim = 45. Figure 2: For linear function fu, v) = u + v: Left) the plot of accuracy vs epoch for 1-layer RNN model, Right) the plot of accuracy of dev set vs word vector dimension d. The result for 2-layer RNN applied to linear function fu, v) = u + v is shown in Figure 3. Here we use wvecdim = 30. Similar to 1-layer RNN, accuracy of dev set increases with epochs. So we take optimal epoch to be 100. In the plot of accuracy vs middledim in Figure 3 right), the accuracy is highest at middledim = 45. Figure 3: For linear function fu, v) = u + v: Left) the plot of accuracy vs epoch for 2-layer RNN model, Right) the plot of accuracy of dev set vs middle dimension of activation layer 3

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 Next we show the result for non linear function fu, v) = u 2 + v. The plots of result for 1-layer RNN is shown in Figure 4. And the plots result for 2-layer RNN is shown in Figure 5. Similar to the result for linear function, accuracy of dev set increases with epoch for both models. For 1-layer RNN, wavecdim = 45 gives highest accuracy. For 2-layer RNN, optimal middledim = 35. Figure 4: For nonlinear function fu, v) = u 2 + v: Left) the plot of accuracy vs epoch for 1-layer RNN model, Right) the plot of accuracy of dev set vs word vector dimension Figure 5: For nonlinear function fu, v) = u 2 + v: Left) the plot of accuracy vs epoch for 2-layer RNN model, Right) the plot of accuracy of dev set vs middle dimension of activation layer Table 1 shows the accuracy for each model with test set. For both functions, 1-layer RNN performs better than 2-layer RNN. For 1-layer RNN, the model gives higher accuracy for linear function. 5 Challenges and Restrictions Table 1: The accuracy of each model for the test set 1-layer RNN 2-layer RNN fu, v) = u + v 0.93 0.70 fu, v) = u 2 + v 0.81 0.72 In Section 3, we describe the model for functions that take two integers as input. We also applied it to only functions that take two inputs in this project. Extending this model to use with functions that take more than two inputs is trivial. It is precisely the framework of Recursive Neural Networks [3]. In particular, the network will have more than one layer of nodes. The model presented in this project still has a lot of restrictions. First restriction is that we require the integer-valued functions with inputs and output in [0, 999]. One extension for this work is to modify 4

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 the model to use with real-valued functions. For the model presented in this paper, extending the range of output increases dimension of parameters. With limited computational power, we could not train the model with higher epochs. The plots of accuracy vs epoch in Section 4 shows that accuracy of dev set increases with epoch. Therefore, one way to improve the accuracy is to train with more values of epoch to find optimal value. Although the model we presented in this paper still has string restriction, the results will be the first step toward applying deep learning to study more complex functions with larger domain and larger images. 6 Conclusion In this paper, we give a model based on Recursive Neural Networks to solve the problem of predicting the value of a function give its inputs. One advantage of this model is that we do not assume the linearity for other properties of the functions. The result show a better performance of 1-layer RNN over 2-layer RNN when applying to both linear and nonlinear functions. Some future study includes modifying the model using more activation layers, adding regularization techniques, and changing nonlinear functions in activation layers. In establish stronger conclusion, we must test the model with more classes of functions. It is also interesting to study the behavior of the model when we train with training set with different distribution. Acknowledgments The author would like to acknowledge Mike Phulsuksombati for the discussion about the project idea and all the data used in this project. He also gives an insightful advice throughout the process. In addition, the author would like to thank the instructor Dr. Richard Socher and all the staff of class CS224D at Stanford University for all wonderful advices and guidance. References [1] Benign, Y. 2012). Practical recommendations for gradient-based training of deep architectures. arxiv:1206.5533. [2] Socher, R., Bauer, J., Manning, C.D., Ng, A.Y. Parsing with Compositional Vector Grammars. [3] Socher, R., Lin, C.C., Ng, A.Y., and Manning, C.D. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. [4] Socher, R., et al. 2013) Recursive deep models for semantic compositionality over a sentiment treebank. Proceedings of the conference on empirical methods in natural language processing EMNLP). Vol. 1631. 2013. [5] Pennington, J., Socher, R., Manning, C.D. GloVe: Global Vectors for Word Representation. [6] Zaremba, W., Kurach, K., and Fergus, R. 2014). Learning to Discover Efficient Mathematical Identities. arxiv:1406.1584v3. [7] Zaremba, W. and Sutskever, I. 2015). Learning to Execute. arxiv:1410.4615v3. 5