Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher

Size: px
Start display at page:

Download "Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher"

Transcription

1 Recurrent Neural Networks Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion Slides were adapted from lectures by Richard Socher

2 Overview Tradi8onal language models RNNs RNN language models Important training problems and tricks Vanishing and exploding gradient problems Bidirec8onal RNNs RNNs for other sequence tasks

3 Language Models A language model computes a probability for a sequence of words: Useful for machine transla8on Word ordering: p(the cat is small) > p(small the is cat) Word choice: p(walking home aoer school) > p(walking house aoer school)

4 Tradi8onal Language Models Probability is usually condi8oned on window of n previous words An incorrect but necessary Markov assump8on! To es8mate probabili8es, compute for unigrams and bigrams (condi8oning on one/two previous word(s):

5 Tradi8onal Language Models Performance improves with keeping around higher n--- grams counts and doing smoothing and so---called backoff (e.g. if 4---gram not found, try 3---gram, etc) There are A LOT of n-grams! à Gigan8c RAM requirements! Recent state of the art: Scalable Modified Kneser-Ney Language Model Es8ma8on by Heafield et al.: Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens

6 Original neural language model A Neural Probabilis8c Language Model, Bengio et al i-th output = P(w t = i context) Original equa8ons: y = b +Wx +U tanh(d + Hx) softmax most computation here Pˆ(w t w t 1, w t n+1 )= ey wt i e y i. tanh Problem: Fixed window of context for condi8oning :( C(w t n+1 )... Table look up in C index for w t n+1... C(w t 2 ) C(w t 1 )... Matrix C index for w t 2 shared parameters across words... index for w t 1

7 Recurrent Neural Networks! RNNs 8e the weights at each 8me step Condi8on the neural network on all previous words RAM requirement only scales with number of words y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 A Recurrent Neural Network (RNN). Three 8me-steps are shown.

8 Recurrent Neural Network language model Given list of word vectors: At a single 8me step: predic8on output at 8mestamp t ßà output features h t output of the previous step the next word vector in the document x t h t

9 Recurrent Neural Network language model Main idea: we use the same set of W weights at all 8me steps! Everything else is the same: Output of the non-linear func8on at the previous 8me-step, t 1. The document context score so far. is the column vector of L at index [t] at 8me step t weights matrix used to condi8on the output of the previous 8me-step, h_(t-1) weights matrix used to condi8on the input word vector, x_t V is the vocabulary.

10 Recurrent Neural Network language model is a probability distribu8on over the vocabulary. y t is the distribu8on at each 8mestamp t. Essen8ally, y t is the next predicted word given the document context score so far (i.e. h t 1 ) and the last observed word vector x (t)

11 RNN Example Vocabulary size C = 8000 A hidden layer size H = 100 U,V and W are the parameters of our network that we want to learn from data. Thus, we need to learn a total of 2HC + H 2 parameters = 1,610,000.

12 Recurrent Neural Network language model Same cross entropy loss func8on but predic8ng words instead of classes

13 Recurrent Neural Network language model Evalua8on could just be nega8ve of average log probability over dataset of size (number of words) T: But more common: Perplexity: 2 J Lower is beoer!

14 Training RNNs is hard Mul8ply the same matrix at each 8me step during forward prop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 Ideally inputs from many 8me steps ago can modify output y Take for an example RNN with 2 8me steps! Insighfull!

15 The vanishing/exploding gradient problem RNN goal is to enable propaga8ng context informa8on through faraway 8me-steps. Recurrent neural networks propagate the same W matrix from one 8me- step to the next. y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1

16 The vanishing/exploding gradient problem Will RNN succeed in the same way in the following sentences? For long sentences, the probability that "John" would be recognized as the next word reduces with the size of the context.

17 The vanishing gradient problem - Details Similar but simpler RNN formula8on: Total error is the sum of each error at 8me steps t Hardcore chain rule applica8on:

18 The vanishing gradient problem - Details Similar to backprop but less efficient formula8on Useful for analysis we ll look at: Remember: More chain rule, remember: Each par8al is a Jacobian: :

19 Jacobian - Reminder

20 The vanishing gradient problem - Details From previous slide: h t 1 h t Remember: To compute Jacobian, derive each element of matrix: Where: Check at home that you understand the diag matrix formula8on

21 The vanishing gradient problem - Details Analyzing the norms of the Jacobians, yields: Where we defined 's as upper bounds of the norms The gradient is a product of Jacobian matrices, each associated with a step in the forward computa8on. This can become very small or very large quickly [Bengio et al 1994], and the locality assump8on of gradient descent breaks down. à Vanishing or exploding gradient

22 Why is the vanishing gradient a problem? The error at a 8me step ideally can tell a previous 8me step from many steps away to change during backprop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1

23 IPython Notebook with vanishing gradient example Example of simple and clean NNet implementa8on Comparison of sigmoid and ReLu units A liole bit of vanishing gradient

24 RNN Python Example The model: Our loss: o 4 V W U W U

25 RNN Python Example: Training Here s an actual training example from our text:

26 RNN Python Example: Parameter Init ini8alize the weights randomly in the interval from, n- is the number of incoming connec8ons from the previous layer forward propaga8on (predic8ng word probabili8es) # Each O t is a vector of probabili8es represen8ng the words in our vocabulary, but some8mes, for example when evalua8ng our model, all we want is the next word with the highest probability:

27 RNN Python Example: Parameter Init For each word in the sentence (in our case: 45), our model made 8000 predic8ons represen8ng probabili8es of the next word (currently random as all U,V,W are random)

28 RNN Python Example: Loss Calcula8on For random init: we have C words in our vocabulary, so each word should be (on average) predicted with probability 1/C, which would yield a loss of :

29 RNN Python Example: Training the RNN with SGD and Backpropaga8on Through Time (BPTT) The model: Our loss: O 4 V W U W U

30 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): Calculate the Gradient Find the parameters U,V,W and that minimize the total loss L on the training data using SGD. # It takes as input a training example (x,y) and outputs the gradients =

31 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): Gradient Checking

32 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): SGD

33 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): SGD Seems the loss is decreasing!

34 RNN Python Example: Generate Text

35 RNN Python Example: Generate Text A few selected (censored) sentences with capitaliza8on for reddit comments Anyway, to the city scene you re an idiot teenager. 2. What?!!!! ignore! 3. Screw fitness, you re saying: hops 4. Thanks for the advice to keep my thoughts around girls. 5. Yep, please disappear with the terrible genera8on. What went right? Successfully learn syntax correctly places commas, ends sentence with punctua8on, mul8ple exclama8on marks etc. What went wrong? Sentences don t make sense or have gramma8cal errors! Why? Our vanilla RNN can t generate meaningful text because it s unable to learn dependencies between words that are several steps apart

36

37 Trick for exploding gradient: clipping trick The solu8on first introduced by Mikolov is to clip gradients to a maximum value. Makes a big difference in RNNs.

38 Gradient clipping intui8on Figure from paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al Error surface of a single unit RNN running through a small number of 8me- steps High curvature walls Solid lines: standard gradient descent trajectories Dashed lines gradients rescaled to fixed size

39 For vanishing gradients: Ini8aliza8on + ReLus! Ini8alize W (*) s to iden8ty matrix I and f(z) = rect(z) = max(z, 0) Test Accuracy LSTM RNN + Tanh RNN + ReLUs IRNN Pixel by pixel permuted MNIST à Huge difference! Steps 5 x 10 Ini8aliza8on idea first introduced in Parsing with Composi8onal Vector Grammars, Socher et al New experiments with recurrent neural nets a year ago (!) in A Simple Way to Ini8alize Recurrent Networks of Rec8fied Linear Units, Le et al. 2015

40 Perplexity Results KN5 = Count-based language model with Kneser-Ney smoothing & 5-grams Perplexity the lower the beoer Table from paper Extensions of recurrent neural network language model by Mikolov et al 2011

41 Problem: So[max is huge and slow Trick: Class-based word predic8on p(w t history) = p(c t history)p(w t c t ) = p(c t h t )p(w t c t ) The more classes, the beoer perplexity but also worse speed: If # classes as your vocabulary you don t get anything Richard Socher 4/20/15

42

43 Sequence modeling for other tasks Classify NER the sen8ment of each word in its context opinionated expressions Example applica8on and slides from paper Opinion Mining with Deep Recurrent Nets by Irsoy and Cardie 2014

44 Opinion Mining with Deep Recurrent Nets Goal: Classify each word as direct subjec8ve expressions (DSEs) and expressive subjec8ve expressions (ESEs). DSE: Explicit men8ons of private states or speech events expressing private states ESE: Expressions that indicate sen8ment, emo8on, etc. without explicitly conveying them.

45 Example Annota8on In BIO nota8on (tags either begin-of-en8ty (B_X) or con8nua8on-of-en8ty (I_X)): The commioee, [as usual] ESE, [has refused to make any statements] DSE.

46 Approach: Recurrent Neural Network Nota8on from paper (so you get used to different ones) y h h t = f (Wx t + Vh t-1 + b) y t = g(uh t + c) x x represents a token (word) as a vector. y represents the output label (B, I or O) g = soomax! h is the memory, computed from the past memory and current word. It summarizes the sentence up to that 8me.

47 Bidirec8onal RNNs Problem: For classifica8on you want to incorporate informa8on from words both preceding and following Ideas? y h x now represents (summarizes) the past and future around a single token.

48 Deep Bidirec8onal RNNs y h (3) h (2) h (1) x Each memory layer passes an intermediate sequential representation to the next.

49 Data MPQA 1.2 corpus (Wiebe et al., 2005) consists of 535 news ar8cles (11,111 sentences) manually labeled at the phrase level (with DSE and ESEs) Evalua8on: F1

50 Evalua8on Bin F # Layers # Layers 24k 200k No8ces that aoer several layers the results start to get worse (overfit?)

51 Summary Recurrent Neural Networks are powerful Next up: Fancy Recursive Neural Networks! (advanced stuff)

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Natural Language Processing with Deep Learning. CS224N/Ling284

Natural Language Processing with Deep Learning. CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Christopher Manning and Richard Socher Classifica6on setup and nota6on Generally

More information

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Natural Language Processing with Deep Learning. CS224N/Ling284

Natural Language Processing with Deep Learning. CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Christopher Manning and Richard Socher Overview Today: Classifica(on background Upda(ng

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting

More information

Recurrent Neural Networks. Armand Joulin Facebook AI research

Recurrent Neural Networks. Armand Joulin Facebook AI research Recurrent Neural Networks Armand Joulin Facebook AI research Introduc9on How to train a deep network: Forward pass: apply your network to your data Evalua.on: Compute the error given by your loss func9on

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Seman&cs with Dense Vectors. Dorota Glowacka

Seman&cs with Dense Vectors. Dorota Glowacka Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

CSCI 315: Artificial Intelligence through Deep Learning

CSCI 315: Artificial Intelligence through Deep Learning CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Recurrent Neural Network 1 NLP-ML marriage 2 An eample SMS complaint n I have purchased a 80 litre Videocon fridge about

More information

Natural Language Processing with Deep Learning. CS224N/Ling284

Natural Language Processing with Deep Learning. CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 9: Recap and Fancy Recurrent Neural Networks for Machine Transla

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1 Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic

More information

(

( Class 15 - Long Short-Term Memory (LSTM) Study materials http://colah.github.io/posts/2015-08-understanding-lstms/ (http://colah.github.io/posts/2015-08-understanding-lstms/) http://karpathy.github.io/2015/05/21/rnn-effectiveness/

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

COMP 562: Introduction to Machine Learning

COMP 562: Introduction to Machine Learning COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Time to Re4lect. Why does a neural net work? What is it good at? Empirical vs theory: what do we know?

Time to Re4lect. Why does a neural net work? What is it good at? Empirical vs theory: what do we know? Text Similarity Announcements Note problems with Midterm mul2ple choice ques2ons 2 and 3. If you got them wrong, you will get credit. Bring your exam back to TA hours. There will be a recita2on tomorrow

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39 HW1 turned in HW2 released Office

More information

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler + Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve

More information

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models MA/CS 109 Lecture 7 Back To Exponen:al Growth Popula:on Models Homework this week 1. Due next Thursday (not Tuesday) 2. Do most of computa:ons in discussion next week 3. If possible, bring your laptop

More information

Learning Long-Term Dependencies with Gradient Descent is Difficult

Learning Long-Term Dependencies with Gradient Descent is Difficult Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua

More information

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Contents. (75pts) COS495 Midterm. (15pts) Short answers Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Deep Learning for NLP Part 2

Deep Learning for NLP Part 2 Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The

More information

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models Machine Learning & Data Mining CS/CNS/EE 155 Lecture 11: Hidden Markov Models 1 Kaggle Compe==on Part 1 2 Kaggle Compe==on Part 2 3 Announcements Updated Kaggle Report Due Date: 9pm on Monday Feb 13 th

More information

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.

More information

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule. CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

CSE 473: Ar+ficial Intelligence

CSE 473: Ar+ficial Intelligence CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high

More information

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Project Tips Announcement: Hint for PSet1: Understand

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

Language Models. Philipp Koehn. 11 September 2018

Language Models. Philipp Koehn. 11 September 2018 Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Computing Neural Network Gradients

Computing Neural Network Gradients Computing Neural Network Gradients Kevin Clark 1 Introduction The purpose of these notes is to demonstrate how to quickly compute neural network gradients in a completely vectorized way. It is complementary

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why

More information

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1 x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);

More information

Deep Learning Recurrent Networks 10/11/2017

Deep Learning Recurrent Networks 10/11/2017 Deep Learning Recurrent Networks 10/11/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information