Learning Unitary Operators with Help from u(n)

Similar documents
Recurrent Neural Networks. Jian Tang

Slide credit from Hung-Yi Lee & Richard Socher

arxiv: v3 [stat.ml] 10 Jan 2017

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 15: Exploding and Vanishing Gradients

Neural Networks 2. 2 Receptive fields and dealing with image inputs

(

Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections

Learning Long-Term Dependencies with Gradient Descent is Difficult

Long-Short Term Memory and Other Gated RNNs

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Long-Short Term Memory

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

CSC321 Lecture 16: ResNets and Attention

Sequence Modeling with Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

RECURRENT NETWORKS I. Philipp Krähenbühl

Recurrent and Recursive Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks

CSCI 315: Artificial Intelligence through Deep Learning

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Lecture 17: Neural Networks and Deep Learning

arxiv: v3 [stat.ml] 19 Jun 2018

arxiv: v1 [stat.ml] 31 Oct 2016

Lecture 15: Exploding and Vanishing Gradients

Introduction to RNNs!

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks Language Models

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Multilayer Perceptron

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

arxiv: v1 [stat.ml] 22 Nov 2016

CS60010: Deep Learning

Recurrent Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

arxiv: v1 [stat.ml] 29 Jul 2017

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Learning & Artificial Intelligence WS 2018/2019

Long Short-Term Memory (LSTM)

Neural Architectures for Image, Language, and Speech Processing

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Recurrent neural networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 5: Multilayer Perceptrons

Learning Long Term Dependencies with Gradient Descent is Difficult

Artificial Neural Networks. MGS Lecture 2

Cheng Soon Ong & Christian Walder. Canberra February June 2018

arxiv: v3 [cs.lg] 14 Jan 2018

Lecture 11 Recurrent Neural Networks I

Introduction to Machine Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Conditional Language modeling with attention

Shallow. Deep. Transits Learning

EE-559 Deep learning LSTM and GRU

Implicitly and Explicitly Constrained Optimization Problems for Training of Recurrent Neural Networks

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

arxiv: v3 [cs.lg] 25 Oct 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Recurrent Neural Network

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Structured Neural Networks (I)

Deep Recurrent Neural Networks

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Non-convex optimization. Issam Laradji

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Tutorial on Tangent Propagation

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Improved Learning through Augmenting the Loss

Introduction to Neural Networks

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

More on Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Feedforward Neural Networks. Michael Collins, Columbia University

Failures of Gradient-Based Deep Learning

Lecture 11 Recurrent Neural Networks I

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

ECS289: Scalable Machine Learning

Neural Networks. Intro to AI Bert Huang Virginia Tech

Neural Network Language Modeling

Natural Language Processing and Recurrent Neural Networks

1 Bayesian Linear Regression (BLR)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

How to do backpropagation in a brain

Transcription:

@_hylandsl Learning Unitary Operators with Help from u(n) Stephanie L. Hyland 1,2, Gunnar Rätsch 1 1 Department of Computer Science, ETH Zurich 2 Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine February 8th, AAAI-17 San Francisco

this work in a nutshell context: vanishing/exploding gradients make training RNNs hard, especially for long sequences having a unitary transition matrix helps gradients problem: learning the unitary matrix solution: parametrisation: use Lie group-lie algebra correspondence and experiments to show it works

context

recurrent neural networks RNN: deep neural network with recurrent units: unrolled o o o o o 0 1 2 t h h0 h1 h2 ht modified from https://colah.github.io/posts/2015-08-understanding-lstms/ hidden state previous hidden state transition matrix input data

long sequences And then nothing can protect us against a complete falling away from our Ideas of duty, or can preserve in the soul a grounded reverence for its law, except the clear conviction that even if there never have been actions springing from such pure sources, the question at issue here is not whether this or that has happened; that, on the contrary, reason by itself and independently of all appearances commands what ought to happen; that consequently actions of which the world has perhaps hitherto given no example actions whose practicability might well be doubted by those who rest everything on experience are nevertheless commanded unrelentingly by reason; and that, for instance, although up to now there may have existed no loyal friend, pure loyalty in friendship can be no less required from every man, inasmuch as this duty, prior to all experience, is contained as duty in general in the Idea of a reason which determines the will by a priori grounds. - Kant, Groundwork of the Metaphysics of Morals (407-408; or pp. 75-76 of the H. J. Paton translation [New York: Harper & Row, 1964]) http://www.stephenhicks.org/2009/06/21/philosophys-longest-sentences-part-2/

long sequences And then nothing can protect us against a complete falling away from our Ideas of duty, or can preserve in the soul a grounded reverence for its law, except the clear conviction that even if there never have been actions springing from such pure sources, the question at issue here is not whether this or that has happened; that, on the contrary, reason by itself and independently of all appearances commands what ought to happen; that consequently actions of which the world has perhaps hitherto given no example actions whose practicability might well be doubted by those who rest everything on experience are nevertheless commanded unrelentingly by reason; and that, for instance, although up to now there may have existed no loyal friend, pure loyalty in friendship can be no less required from every man, inasmuch as this duty, prior to all experience, is contained as duty in general in the Idea of a reason which determines the will by a priori grounds. - Kant, Groundwork of the Metaphysics of Morals (407-408; or pp. 75-76 of the H. J. Paton translation [New York: Harper & Row, 1964]) http://www.stephenhicks.org/2009/06/21/philosophys-longest-sentences-part-2/

gradient instability recurrence relation again: nested functions etc. chain rule (back-propagation) products of derivatives gradient of cost with respect to hidden state explodes or vanishes deeper into network

unitary RNN standard RNN: unitary RNN: U is unitary: U preserves norms:

how to learn U? for deep learning: stochastic gradient descent general update rule not necessarily unitary to keep U unitary: 1. update and project 2. use a parametrisation 3.???

parametrising U(n)

our approach observations: unitary matrices from a group - a Lie group - U(n) Lie groups have associated Lie algebras Lie algebras are vector spaces - closed under addition for the unitary group, the map from Lie algebra to Lie group is surjective therefore we can parametrise unitary matrices by the coefficients (relative to a basis) of elements of the Lie algebra this can describe any unitary matrix

Lie groups and algebras Lie algebra (e.g. u(n)) (vector space) (Tangent space to the Lie group at its identity element, p = e) Lie group (e.g. U(n)) (a group, which is also a manifold) figure modified from Blagoje Oblak, From the Lorentz Group to the Celestial Sphere, arxiv 2015

the exponential map Lie algebra Lie group exp L * U(n) is a matrix group, so is the matrix exponential: * U figure modified from Blagoje Oblak, From the Lorentz Group to the Celestial Sphere, arxiv 2015

the parametrisation λ are coefficients relative to a basis of the Lie algebra u(n) of L; and we get U from L using the exponential map: basis matrix (sparse)

experiments and results

learning matrices simple task: learn matrix relating pairs of vectors: noise (Gaussian) given given minimise squared loss between true and predicted vectors: compare multiple approaches and matrix sizes: Arjovsky, Shah & Bengio - Unitary Evolution Recurrent Neural Networks, ICML 2016

urnn task 1: adding task: add two marked entries in a long sequence of numbers input sequence: (example) 03509123186072943327297938759908347645093841572380 00000000000010000000000000000001000000000000000000 desired output: 15 (= 7 + 8) cost: MSE between predicted & correct answers length of sequence: length of time to remember (real version uses floats between 0 and 1, but that s harder to show)

urnn task 2: memory task: remember a sequence of digits over a long time period input sequence: (example) 58329568929000000000000000000000000000000000000000 desired output sequence: 00000000000000000000000000000000000000005832956892 cost: cross-entropy between predicted & true output sequences number of zeroes: length of time to remember

urnn results (task 1 & 2) MSE 0.20 0.15 0.10 0.05 0.00 adding task, T = 100 which CE 0.25 0.20 IRNN 0.15 LSTM 0.05 gurnn urnn 0.10 gurnn+ 0.00 memory task, T = 100 model LSTM IRNN gurnn 1 gurnn β urnn 0 10000 20000 30000 40000 50000 updates 0 10000 20000 30000 updates Arjovsky, Shah & Bengio - Unitary Evolution Recurrent Neural Networks, ICML 2016

conclusion

this work in a nutshell context: vanishing/exploding gradients make training RNNs hard, especially for long sequences having a unitary transition matrix helps gradients problem: learning the unitary matrix again solution: parametrisation: use Lie group-lie algebra correspondence and experiments to show it works

further questions geometry: optimisation directly on the Lie group (manifold)? how does the exponential map change the cost surface? deep learning: role of nonlinearity in vanishing/exploding gradients: focusing on transition operator is not enough are complex-valued states necessary/helpful?

thanks! code https://github.com/ratschlab/urnn paper https://arxiv.org/abs/1607.04903 contact @_hylandsl hyland@inf.ethz.ch