Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs
|
|
- Emil Kelly
- 5 years ago
- Views:
Transcription
1 Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata and Frank Keller 26 January 2018 Scool of Informatics University of Edinburg Backpropagation Troug Time Long sort-term memory Reading: Mikolov et al (2010), Ola (2015). 1 2 Most models in NLP are probabilistic models E.g. language model decomposed wit cain rule of probability. k P(w 1 w k ) = P(w i w 1,, w i 1 ) i=1 Recap: probability, language models, and feedforward networks Modeling decision: Markov assumption P(w i w 1,, w i 1 ) P(w i w i n1,, w i 1 ) Rules of probability (remember: vocabulary V is finite) P : V R P(w w i n1,, w i 1 ) = 1 w V 3
2 MLPs (aka deep NNs) are functions from a vector to a vector Probability distributions are vectors! Summer is ot winter is Wat functions can we use? Matrix multiplication: convert an m-element vector to an n-element vector. Parameters are usually of tis form. Sigmoid, exp, tan, RELU, etc: elementwise nonlinear transform from m-element vector to m-element vector. Concatenate an m-element and n-element vector into an (m n)-element vector. Multiple functions can also sare and substructure. cold 0.6 grey 0.3 winter 0.1 is 0 ot 0 summer 0 4 Softmax will convert any vector to a probability distribution. 5 Elements of discrete vocabularies are vectors! Feedforward LM: function from a vectors to a vector Summer is ot winter is is cold grey ot summer winter Use one-ot encoding to represent any element of a finite set. 6 7
3 How muc context do we need? Te roses are red. Te roses in te vase are red. Te roses in te vase by te door are red. Te roses in te vase by te door to te kitcen are red. Simple Recurrent Networks Captain Aab nursed is grudge for many years before seeking te Wite Donald Trump nursed is grudge for many years before seeking te Wite 8 Modeling Context Context is important in language modeling: n-gram language models use a limited context (fixed n); feedforward networks can be used for language modeling, but teir is also of fixed size; but linguistic dependencies can be arbitrarily long. Tis is were neural networks come in: te of an RNN includes a copy of te previous idden layer of te network; effectively, te RNN buffers all te s it as seen before; it can tus model context dependencies of arbitrary lengt. We will look at simple networks first. Arcitecture Te simple networks only looks back one time step: x(t) s(t-1) V U s(t) W y(t) 9 10
4 Arcitecture Input and Output We ave layer x, idden layer s (state), output layer y. Te at time t is x(t), output is y(t), and idden layer s(t). s j (t) = f (net j (t)) (1) net j (t) = l x i (t)v ji s (t 1)u j (2) i y k (t) = g(net k (t)) (3) net k (t) = s j (t)w kj (4) j were f (z) is te sigmoid, and g(z) te softmax function: f (z) = 1 1 e z g(z m ) = ezm k ez k For initialization, set s and x to small random values; for eac time step, copy s(t 1) and use it to compute s(t); vector x(t) uses 1-of-N (one ot) encoding over te words in te vocabulary; output vector y(t) is a probability distribution over te next word given te current word w(t) and context s(t 1); size of idden layer is usually units, depending on size of training data Training We can use standard backprop wit stocastic gradient descent: simply treat te network as a feedforward network wit s(t 1) as additional ; backpropagate te error to adjust weigt matrices U and V; present all of te training data in eac epoc; test on validation data to see if log-likeliood of training data improves; adjust learning rate if necessary. Backpropagation Troug Time Error signal for training: error(t) = desired(t) y(t) were desired(t) is te one-ot encoding of te correct next word. 13
5 From Simple to Full RNNs Arcitecture Te full RNN looks at all te previous time steps: x(t) Let s drop te assumption tat only te idden layer from te previous time step is used; instead use all previous time steps; we can tink of tis as unfolding over time: te RNN is unfolded into a sequence of feedforward networks; we need a new learning algoritm: backpropagation troug time (BPTT). x(t-2) V U x(t-1) s(t-2) V U s(t-1) V U s(t) W y(t) s(t-3) Standard Backpropagation Going Back in Time For output units, we update te weigts W using: n w kj = η δ pk s pj δ pk = (d pk y pk )g (net pk ) p were d pk is te desired output of unit k for training pattern p. For idden units, we update te weigts V using: n o v ji = η δ pj x pi δ pj = δ pk w kj f (net pj ) p k Tis is just standard backprop, wit notation adjusted for RNNs! If we only go back one time step, ten we can update te weigts U using te standard delta rule: u ji = η n δ pj (t)s p (t 1) δ pj (t) = p o δ pk w kj f (net pj ) However, if we go furter back in time, ten we need to apply te delta rule to te previous time step as well: δ pj (t 1) = δ p (t)u j f (s pj (t 1)) were is te index for te idden unit at time step t, and j for te idden unit at time step t 1. k 16 17
6 Going Back in Time We can do tis for an arbitrary number of time steps τ, adding up te resulting deltas to compute u ji. Te RNN effectively becomes a deep network of dept τ. For language modeling, Mikolov et al. sow tat increased τ improves performance. As we backpropagate troug time, gradients tend toward 0 We adjust U using backprop troug time. For timestep t: n o u ji = η δ pj (t)s p (t 1) δ pj (t) = δ pk w kj f (net pj ) p k For timestep t 1: δ pj (t 1) = δ p (t)u j f (s pj (t 1)) For time step t 2: δ pj (t 2) = δ p (t 1)u j f (s pj (t 2)) = δ p1 (t)u 1 jf (s pj (t 1))u j f (s pj (t 2)) As we backpropagate troug time, gradients tend toward 0 As we backpropagate troug time, gradients tend toward 0 At every time step, we multiply te weigts wit anoter gradient. Te gradients are < 1 so te deltas become smaller and smaller. So in fact, te RNN is not able to learn long-range dependencies well, as te gradient vanises: it rapidly forgets previous s: [Source: ttps://teclevermacine.wordpress.com/] 20 [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] 21
7 A better RNN: Long Sort-term Memory Solution: network can sometimes pass on information from previous time steps uncanged, so tat it can learn from distant s: Long sort-term memory 22 Arcitecture of te LSTM Te Gates and te Memory Cell To acieve tis, we need to make te units of te network more complicated: LSTMs ave a idden layer of memory blocks; eac block contains a memory cell and tree multiplicative units: te, output and forget gates; te gates are trainable: eac block can learn weter to keep information across time steps or not. In contrast, te RNN uses simple idden units, wic just sum te and pass it troug an activation function. Eac memory block consists of four units: [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] O: open gate --: closed gate black: ig activation wite: low activation Input gate: controls weter te to is passed on to te memory cell or ignored; Output gate: controls weter te current activation vector of te memory cell is passed on to te output layer or not; Forget gate: controls weter te activation vector of te memory cell is reset to zero or maintained; Memory cell: stores te current activation vector; wit connection to itself controlled by forget gate. Tere are also peepole connections; we won t discuss tese
8 A Single LSTM Memory Block RNN Unit compared to LSTM Memory Block SRN unit output g block output LSTM block peepoles cell forget gate block output g output gate gate Legend unweigted connection weigted connection connection wit time-lag brancing point mutliplication sum over all s g gate activation function (always sigmoid) activation function (usually tan) output activation function (usually tan) [Source: Klaus Greff et al.: LSTM: A Searc Space Odyssey, 2015.] [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] Te Gates and te Memory Cell Putting LSTM Memory Blocks Togeter Gates are regular idden units: tey sum teir and pass it troug a sigmoid activation function; all four s to te block are te same: te layer and te layer (idden layer at previous time step); all gates ave multiplicative connections: if te activation is close to zero, ten te gate doesn t let anyting troug; te memory cell itself is linear: it as no activation function; but te block as a wole as and output activation functions (can be tan or sigmoid); all connections witin te block are unweigted: tey just pass on information (i.e., copy te incoming vector); te only output tat te rest of te network sees is wat te output gate lets troug. 27 Network wit four units, a idden layer of two memory blocks and five output units: [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] 28
9 Vanising Gradients Again Wy does tis solve te vanising gradient problem? te memory cell is linear, so its gradient doesn t vanis; an LSTM block can retain information indefinitely: if te forget gate is open (close to 1) and te gate is closed (close to 0), ten te activation of te cell persists; in addition, te block can decide wen to output information by opening te output gate; te block can terefore retain information over an arbitrary number of time steps before it outputs it; te block learns wen to accept, produce output, and forget information: te gates ave trainable weigts. Applications LSTMs are useful for lots of sequence labeling tasks: part of speec tagging and parsing; semantic role labeling; opinion mining. Wit modification, also widely used for sequence-to-sequence problems: macine translation question answering; summarization; sentence compression and simplification. We will see some of tese applications in te rest of te course Summary Recurrent networks encode a complete sequence. RNNs can be trained wit standard backprop. We can also unfold an RNN over time and train it wit backpropagation troug time; Turns te RNN into a deep network; even better language modeling performance. Backprop troug time wit RNNs as te problem tat gradients vanis wit increasing timesteps. Te LSTM is a way of addressing tis problem. It replaces additive idden units wit complex memory blocks. 31
Notes on Neural Networks
Artificial neurons otes on eural etwors Paulo Eduardo Rauber 205 Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te tas of supervised learning consists on finding a function f : R m R d tat
More informationReading Group on Deep Learning Session 4 Unsupervised Neural Networks
Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann
More informationIntroduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =
Introduction to Macine Learning Lecturer: Regev Scweiger Recitation 8 Fall Semester Scribe: Regev Scweiger 8.1 Backpropagation We will develop and review te backpropagation algoritm for neural networks.
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More information5.1 We will begin this section with the definition of a rational expression. We
Basic Properties and Reducing to Lowest Terms 5.1 We will begin tis section wit te definition of a rational epression. We will ten state te two basic properties associated wit rational epressions and go
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationHow to Find the Derivative of a Function: Calculus 1
Introduction How to Find te Derivative of a Function: Calculus 1 Calculus is not an easy matematics course Te fact tat you ave enrolled in suc a difficult subject indicates tat you are interested in te
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationContinuity and Differentiability Worksheet
Continuity and Differentiability Workseet (Be sure tat you can also do te grapical eercises from te tet- Tese were not included below! Typical problems are like problems -3, p. 6; -3, p. 7; 33-34, p. 7;
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More information1. Which one of the following expressions is not equal to all the others? 1 C. 1 D. 25x. 2. Simplify this expression as much as possible.
004 Algebra Pretest answers and scoring Part A. Multiple coice questions. Directions: Circle te letter ( A, B, C, D, or E ) net to te correct answer. points eac, no partial credit. Wic one of te following
More informationLab 6 Derivatives and Mutant Bacteria
Lab 6 Derivatives and Mutant Bacteria Date: September 27, 20 Assignment Due Date: October 4, 20 Goal: In tis lab you will furter explore te concept of a derivative using R. You will use your knowledge
More informationProbabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm
Probabilistic Grapical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. Tis omework assignment covers te material presented in Lectures 1-3. You must complete all four problems to obtain
More informationCSCI 315: Artificial Intelligence through Deep Learning
CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)
More informationChapters 19 & 20 Heat and the First Law of Thermodynamics
Capters 19 & 20 Heat and te First Law of Termodynamics Te Zerot Law of Termodynamics Te First Law of Termodynamics Termal Processes Te Second Law of Termodynamics Heat Engines and te Carnot Cycle Refrigerators,
More informationNumerical Differentiation
Numerical Differentiation Finite Difference Formulas for te first derivative (Using Taylor Expansion tecnique) (section 8.3.) Suppose tat f() = g() is a function of te variable, and tat as 0 te function
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationCombining functions: algebraic methods
Combining functions: algebraic metods Functions can be added, subtracted, multiplied, divided, and raised to a power, just like numbers or algebra expressions. If f(x) = x 2 and g(x) = x + 2, clearly f(x)
More information3.4 Worksheet: Proof of the Chain Rule NAME
Mat 1170 3.4 Workseet: Proof of te Cain Rule NAME Te Cain Rule So far we are able to differentiate all types of functions. For example: polynomials, rational, root, and trigonometric functions. We are
More informationDifferential Calculus (The basics) Prepared by Mr. C. Hull
Differential Calculus Te basics) A : Limits In tis work on limits, we will deal only wit functions i.e. tose relationsips in wic an input variable ) defines a unique output variable y). Wen we work wit
More informationTHE hidden Markov model (HMM)-based parametric
JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Modeling Spectral Envelopes Using Restricted Boltzmann Macines and Deep Belief Networks for Statistical Parametric Speec Syntesis Zen-Hua Ling,
More informationSymmetry Labeling of Molecular Energies
Capter 7. Symmetry Labeling of Molecular Energies Notes: Most of te material presented in tis capter is taken from Bunker and Jensen 1998, Cap. 6, and Bunker and Jensen 2005, Cap. 7. 7.1 Hamiltonian Symmetry
More informationAdaptive Neural Filters with Fixed Weights
Adaptive Neural Filters wit Fixed Weigts James T. Lo and Justin Nave Department of Matematics and Statistics University of Maryland Baltimore County Baltimore, MD 150, U.S.A. e-mail: jameslo@umbc.edu Abstract
More informationLecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.
Lecture XVII Abstract We introduce te concept of directional derivative of a scalar function and discuss its relation wit te gradient operator. Directional derivative and gradient Te directional derivative
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationEfficient algorithms for for clone items detection
Efficient algoritms for for clone items detection Raoul Medina, Caroline Noyer, and Olivier Raynaud Raoul Medina, Caroline Noyer and Olivier Raynaud LIMOS - Université Blaise Pascal, Campus universitaire
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More information2.3 Product and Quotient Rules
.3. PRODUCT AND QUOTIENT RULES 75.3 Product and Quotient Rules.3.1 Product rule Suppose tat f and g are two di erentiable functions. Ten ( g (x)) 0 = f 0 (x) g (x) + g 0 (x) See.3.5 on page 77 for a proof.
More information3.1 Extreme Values of a Function
.1 Etreme Values of a Function Section.1 Notes Page 1 One application of te derivative is finding minimum and maimum values off a grap. In precalculus we were only able to do tis wit quadratics by find
More informationREVIEW LAB ANSWER KEY
REVIEW LAB ANSWER KEY. Witout using SN, find te derivative of eac of te following (you do not need to simplify your answers): a. f x 3x 3 5x x 6 f x 3 3x 5 x 0 b. g x 4 x x x notice te trick ere! x x g
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationLIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT
LIMITS AND DERIVATIVES Te limit of a function is defined as te value of y tat te curve approaces, as x approaces a particular value. Te limit of f (x) as x approaces a is written as f (x) approaces, as
More informationPreface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.
Preface Here are my online notes for my course tat I teac ere at Lamar University. Despite te fact tat tese are my class notes, tey sould be accessible to anyone wanting to learn or needing a refreser
More informationAnnouncements. Final exam: Dec. 21 st, 1:10-4pm. Class par9cipa9on grades in courseworks: 10% of grade
Text Summarization Announcements Final exam: Dec. 21 st, 1:10-4pm Class par9cipa9on grades in courseorks: 10% of grade AlpaGo documentary free screening. 5:30pm, Tuesday November 21, Roone Arledge Cinema,
More informationMulti-layer Neural Networks
Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural
More informationPhysically Based Modeling: Principles and Practice Implicit Methods for Differential Equations
Pysically Based Modeling: Principles and Practice Implicit Metods for Differential Equations David Baraff Robotics Institute Carnegie Mellon University Please note: Tis document is 997 by David Baraff
More informationA Long Short-Term Memory Recurrent Neural Network Framework for Network Traffic Matrix Prediction
A Long Sort-Term Memory Recurrent Neural Network Framework for Network Traffic Matrix Prediction Abdeladi Azzouni and Guy Pujolle LIP6 / UPMC; Paris, France {abdeladi.azzouni,guy.pujolle}@lip6.fr arxiv:1705.05690v3
More informationRecurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves
Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text
More informationLong-Short Term Memory
Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow
More informationCubic Functions: Local Analysis
Cubic function cubing coefficient Capter 13 Cubic Functions: Local Analysis Input-Output Pairs, 378 Normalized Input-Output Rule, 380 Local I-O Rule Near, 382 Local Grap Near, 384 Types of Local Graps
More informationSection 15.6 Directional Derivatives and the Gradient Vector
Section 15.6 Directional Derivatives and te Gradient Vector Finding rates of cange in different directions Recall tat wen we first started considering derivatives of functions of more tan one variable,
More informationMath 102 TEST CHAPTERS 3 & 4 Solutions & Comments Fall 2006
Mat 102 TEST CHAPTERS 3 & 4 Solutions & Comments Fall 2006 f(x+) f(x) 10 1. For f(x) = x 2 + 2x 5, find ))))))))) and simplify completely. NOTE: **f(x+) is NOT f(x)+! f(x+) f(x) (x+) 2 + 2(x+) 5 ( x 2
More informationDeep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy
Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy Moammad Ali Keyvanrad a, Moammad Medi Homayounpour a a Laboratory for Intelligent Multimedia Processing (LIMP), Computer
More informationNatural Language Processing
Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay LSTM 15 jun, 2017 lgsoft:nlp:lstm:pushpak 1 Recap 15 jun, 2017 lgsoft:nlp:lstm:pushpak 2 Feedforward Network and Backpropagation
More informationRecurrent and Recursive Networks
Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationRegularized Regression
Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize
More informationOnline Learning: Bandit Setting
Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac
More information232 Calculus and Structures
3 Calculus and Structures CHAPTER 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS FOR EVALUATING BEAMS Calculus and Structures 33 Copyrigt Capter 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS 17.1 THE
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationOverdispersed Variational Autoencoders
Overdispersed Variational Autoencoders Harsil Sa, David Barber and Aleksandar Botev Department of Computer Science, University College London Alan Turing Institute arsil.sa.15@ucl.ac.uk, david.barber@ucl.ac.uk,
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationContinuous Stochastic Processes
Continuous Stocastic Processes Te term stocastic is often applied to penomena tat vary in time, wile te word random is reserved for penomena tat vary in space. Apart from tis distinction, te modelling
More informationMinimizing D(Q,P) def = Q(h)
Inference Lecture 20: Variational Metods Kevin Murpy 29 November 2004 Inference means computing P( i v), were are te idden variables v are te visible variables. For discrete (eg binary) idden nodes, exact
More informationDeep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning
Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More information1 2 x Solution. The function f x is only defined when x 0, so we will assume that x 0 for the remainder of the solution. f x. f x h f x.
Problem. Let f x x. Using te definition of te derivative prove tat f x x Solution. Te function f x is only defined wen x 0, so we will assume tat x 0 for te remainder of te solution. By te definition of
More informationLecture 10: Carnot theorem
ecture 0: Carnot teorem Feb 7, 005 Equivalence of Kelvin and Clausius formulations ast time we learned tat te Second aw can be formulated in two ways. e Kelvin formulation: No process is possible wose
More informationSin, Cos and All That
Sin, Cos and All Tat James K. Peterson Department of Biological Sciences and Department of Matematical Sciences Clemson University Marc 9, 2017 Outline Sin, Cos and all tat! A New Power Rule Derivatives
More informationFinancial Econometrics Prof. Massimo Guidolin
CLEFIN A.A. 2010/2011 Financial Econometrics Prof. Massimo Guidolin A Quick Review of Basic Estimation Metods 1. Were te OLS World Ends... Consider two time series 1: = { 1 2 } and 1: = { 1 2 }. At tis
More informationCopyright c 2008 Kevin Long
Lecture 4 Numerical solution of initial value problems Te metods you ve learned so far ave obtained closed-form solutions to initial value problems. A closedform solution is an explicit algebriac formula
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More information2.11 That s So Derivative
2.11 Tat s So Derivative Introduction to Differential Calculus Just as one defines instantaneous velocity in terms of average velocity, we now define te instantaneous rate of cange of a function at a point
More informationMULTI-DISTRIBUTION DEEP BELIEF NETWORK FOR SPEECH SYNTHESIS. Shiyin Kang, Xiaojun Qian and Helen Meng
MULTI-DISTRIBUTION DEEP BELIEF NETORK FOR SPEECH SYNTHESIS Siyin Kang, Xiaojun Qian and Helen Meng Human Computer Communications Laboratory, Department of Systems Engineering and Engineering Management,
More information0.1 Differentiation Rules
0.1 Differentiation Rules From our previous work we ve seen tat it can be quite a task to calculate te erivative of an arbitrary function. Just working wit a secon-orer polynomial tings get pretty complicate
More informationMVT and Rolle s Theorem
AP Calculus CHAPTER 4 WORKSHEET APPLICATIONS OF DIFFERENTIATION MVT and Rolle s Teorem Name Seat # Date UNLESS INDICATED, DO NOT USE YOUR CALCULATOR FOR ANY OF THESE QUESTIONS In problems 1 and, state
More informationMaterial for Difference Quotient
Material for Difference Quotient Prepared by Stepanie Quintal, graduate student and Marvin Stick, professor Dept. of Matematical Sciences, UMass Lowell Summer 05 Preface Te following difference quotient
More informationRecurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)
Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following
More informationFundamentals of Concept Learning
Aims 09s: COMP947 Macine Learning and Data Mining Fundamentals of Concept Learning Marc, 009 Acknowledgement: Material derived from slides for te book Macine Learning, Tom Mitcell, McGraw-Hill, 997 ttp://www-.cs.cmu.edu/~tom/mlbook.tml
More informationDeep Generative Models
Deep Generative Models Durk Kingma Max Welling Deep Probabilistic Models Worksop Wednesday, 1st of Oct, 2014 D.P. Kingma Deep generative models Transformations between Bayes nets and Neural nets Transformation
More informationMTH-112 Quiz 1 Name: # :
MTH- Quiz Name: # : Please write our name in te provided space. Simplif our answers. Sow our work.. Determine weter te given relation is a function. Give te domain and range of te relation.. Does te equation
More informationFunction Composition and Chain Rules
Function Composition and s James K. Peterson Department of Biological Sciences and Department of Matematical Sciences Clemson University Marc 8, 2017 Outline 1 Function Composition and Continuity 2 Function
More information10 Derivatives ( )
Instructor: Micael Medvinsky 0 Derivatives (.6-.8) Te tangent line to te curve yf() at te point (a,f(a)) is te line l m + b troug tis point wit slope Alternatively one can epress te slope as f f a m lim
More informationPart 2: Introduction to Open-Channel Flow SPRING 2005
Part : Introduction to Open-Cannel Flow SPRING 005. Te Froude number. Total ead and specific energy 3. Hydraulic jump. Te Froude Number Te main caracteristics of flows in open cannels are tat: tere is
More informationA = h w (1) Error Analysis Physics 141
Introduction In all brances of pysical science and engineering one deals constantly wit numbers wic results more or less directly from experimental observations. Experimental observations always ave inaccuracies.
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationDerivatives of Exponentials
mat 0 more on derivatives: day 0 Derivatives of Eponentials Recall tat DEFINITION... An eponential function as te form f () =a, were te base is a real number a > 0. Te domain of an eponential function
More informationTracking the World State with Recurrent Entity Networks
Tracking the World State with Recurrent Entity Networks Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun Task At each timestep, get information (in the form of a sentence) about the
More informationThe total error in numerical differentiation
AMS 147 Computational Metods and Applications Lecture 08 Copyrigt by Hongyun Wang, UCSC Recap: Loss of accuracy due to numerical cancellation A B 3, 3 ~10 16 In calculating te difference between A and
More informationA Theoretically Grounded Application of Dropout in Recurrent Neural Networks
A Teoretically Grounded Application of Dropout in Recurrent Neural Networks Yarin Gal University of Cambridge {yg279,zg201}@cam.ac.uk oubin Garamani Abstract Recurrent neural networks (RNNs) stand at te
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationPractice Problem Solutions: Exam 1
Practice Problem Solutions: Exam 1 1. (a) Algebraic Solution: Te largest term in te numerator is 3x 2, wile te largest term in te denominator is 5x 2 3x 2 + 5. Tus lim x 5x 2 2x 3x 2 x 5x 2 = 3 5 Numerical
More informationBasic Nonparametric Estimation Spring 2002
Basic Nonparametric Estimation Spring 2002 Te following topics are covered today: Basic Nonparametric Regression. Tere are four books tat you can find reference: Silverman986, Wand and Jones995, Hardle990,
More information158 Calculus and Structures
58 Calculus and Structures CHAPTER PROPERTIES OF DERIVATIVES AND DIFFERENTIATION BY THE EASY WAY. Calculus and Structures 59 Copyrigt Capter PROPERTIES OF DERIVATIVES. INTRODUCTION In te last capter you
More informationLong Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) A brief introduction Daniel Renshaw 24th November 2014 1 / 15 Context and notation Just to give the LSTM something to do: neural network language modelling Vocabulary, size
More informationLearning based super-resolution land cover mapping
earning based super-resolution land cover mapping Feng ing, Yiang Zang, Giles M. Foody IEEE Fellow, Xiaodong Xiuua Zang, Siming Fang, Wenbo Yun Du is work was supported in part by te National Basic Researc
More information1 Limits and Continuity
1 Limits and Continuity 1.0 Tangent Lines, Velocities, Growt In tion 0.2, we estimated te slope of a line tangent to te grap of a function at a point. At te end of tion 0.3, we constructed a new function
More informationTeaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line
Teacing Differentiation: A Rare Case for te Problem of te Slope of te Tangent Line arxiv:1805.00343v1 [mat.ho] 29 Apr 2018 Roman Kvasov Department of Matematics University of Puerto Rico at Aguadilla Aguadilla,
More informationTHE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225
THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Mat 225 As we ave seen, te definition of derivative for a Mat 111 function g : R R and for acurveγ : R E n are te same, except for interpretation:
More informationChapter 1D - Rational Expressions
- Capter 1D Capter 1D - Rational Expressions Definition of a Rational Expression A rational expression is te quotient of two polynomials. (Recall: A function px is a polynomial in x of degree n, if tere
More informationMATH1131/1141 Calculus Test S1 v8a
MATH/ Calculus Test 8 S v8a October, 7 Tese solutions were written by Joann Blanco, typed by Brendan Trin and edited by Mattew Yan and Henderson Ko Please be etical wit tis resource It is for te use of
More information5. (a) Find the slope of the tangent line to the parabola y = x + 2x
MATH 141 090 Homework Solutions Fall 00 Section.6: Pages 148 150 3. Consider te slope of te given curve at eac of te five points sown (see text for figure). List tese five slopes in decreasing order and
More information4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.
Mat 11. Test Form N Fall 016 Name. Instructions. Te first eleven problems are wort points eac. Te last six problems are wort 5 points eac. For te last six problems, you must use relevant metods of algebra
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationAPPENDIXES. Let the following constants be established for those using the active Mathcad
3 APPENDIXES Let te following constants be establised for tose using te active Matcad form of tis book: m.. e 9.09389700 0 3 kg Electron rest mass. q.. o.6077330 0 9 coul Electron quantum carge. µ... o.5663706
More information1watt=1W=1kg m 2 /s 3
Appendix A Matematics Appendix A.1 Units To measure a pysical quantity, you need a standard. Eac pysical quantity as certain units. A unit is just a standard we use to compare, e.g. a ruler. In tis laboratory
More information