Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher
|
|
- Whitney McBride
- 5 years ago
- Views:
Transcription
1 Recurrent Neural Networks Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion Slides were adapted from lectures by Richard Socher
2 Overview Tradi8onal language models RNNs RNN language models Important training problems and tricks Vanishing and exploding gradient problems Bidirec8onal RNNs RNNs for other sequence tasks
3 Language Models A language model computes a probability for a sequence of words: Useful for machine transla8on Word ordering: p(the cat is small) > p(small the is cat) Word choice: p(walking home aoer school) > p(walking house aoer school)
4 Tradi8onal Language Models Probability is usually condi8oned on window of n previous words An incorrect but necessary Markov assump8on! To es8mate probabili8es, compute for unigrams and bigrams (condi8oning on one/two previous word(s):
5 Tradi8onal Language Models Performance improves with keeping around higher n--- grams counts and doing smoothing and so---called backoff (e.g. if 4---gram not found, try 3---gram, etc) There are A LOT of n-grams! à Gigan8c RAM requirements! Recent state of the art: Scalable Modified Kneser-Ney Language Model Es8ma8on by Heafield et al.: Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens
6 Original neural language model A Neural Probabilis8c Language Model, Bengio et al i-th output = P(w t = i context) Original equa8ons: y = b +Wx +U tanh(d + Hx) softmax most computation here Pˆ(w t w t 1, w t n+1 )= ey wt i e y i. tanh Problem: Fixed window of context for condi8oning :( C(w t n+1 )... Table look up in C index for w t n+1... C(w t 2 ) C(w t 1 )... Matrix C index for w t 2 shared parameters across words... index for w t 1
7 Recurrent Neural Networks! RNNs 8e the weights at each 8me step Condi8on the neural network on all previous words RAM requirement only scales with number of words y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 A Recurrent Neural Network (RNN). Three 8me-steps are shown.
8 Recurrent Neural Network language model Given list of word vectors: At a single 8me step: predic8on output at 8mestamp t ßà output features h t output of the previous step the next word vector in the document x t h t
9 Recurrent Neural Network language model Main idea: we use the same set of W weights at all 8me steps! Everything else is the same: Output of the non-linear func8on at the previous 8me-step, t 1. The document context score so far. is the column vector of L at index [t] at 8me step t weights matrix used to condi8on the output of the previous 8me-step, h_(t-1) weights matrix used to condi8on the input word vector, x_t V is the vocabulary.
10 Recurrent Neural Network language model is a probability distribu8on over the vocabulary. y t is the distribu8on at each 8mestamp t. Essen8ally, y t is the next predicted word given the document context score so far (i.e. h t 1 ) and the last observed word vector x (t)
11 RNN Example Vocabulary size C = 8000 A hidden layer size H = 100 U,V and W are the parameters of our network that we want to learn from data. Thus, we need to learn a total of 2HC + H 2 parameters = 1,610,000.
12 Recurrent Neural Network language model Same cross entropy loss func8on but predic8ng words instead of classes
13 Recurrent Neural Network language model Evalua8on could just be nega8ve of average log probability over dataset of size (number of words) T: But more common: Perplexity: 2 J Lower is beoer!
14 Training RNNs is hard Mul8ply the same matrix at each 8me step during forward prop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1 Ideally inputs from many 8me steps ago can modify output y Take for an example RNN with 2 8me steps! Insighfull!
15 The vanishing/exploding gradient problem RNN goal is to enable propaga8ng context informa8on through faraway 8me-steps. Recurrent neural networks propagate the same W matrix from one 8me- step to the next. y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1
16 The vanishing/exploding gradient problem Will RNN succeed in the same way in the following sentences? For long sentences, the probability that "John" would be recognized as the next word reduces with the size of the context.
17 The vanishing gradient problem - Details Similar but simpler RNN formula8on: Total error is the sum of each error at 8me steps t Hardcore chain rule applica8on:
18 The vanishing gradient problem - Details Similar to backprop but less efficient formula8on Useful for analysis we ll look at: Remember: More chain rule, remember: Each par8al is a Jacobian: :
19 Jacobian - Reminder
20 The vanishing gradient problem - Details From previous slide: h t 1 h t Remember: To compute Jacobian, derive each element of matrix: Where: Check at home that you understand the diag matrix formula8on
21 The vanishing gradient problem - Details Analyzing the norms of the Jacobians, yields: Where we defined 's as upper bounds of the norms The gradient is a product of Jacobian matrices, each associated with a step in the forward computa8on. This can become very small or very large quickly [Bengio et al 1994], and the locality assump8on of gradient descent breaks down. à Vanishing or exploding gradient
22 Why is the vanishing gradient a problem? The error at a 8me step ideally can tell a previous 8me step from many steps away to change during backprop y t 1 y t y t+1 h t 1 W h t W h t+1 x t 1 x t x t+1
23 IPython Notebook with vanishing gradient example Example of simple and clean NNet implementa8on Comparison of sigmoid and ReLu units A liole bit of vanishing gradient
24 RNN Python Example The model: Our loss: o 4 V W U W U
25 RNN Python Example: Training Here s an actual training example from our text:
26 RNN Python Example: Parameter Init ini8alize the weights randomly in the interval from, n- is the number of incoming connec8ons from the previous layer forward propaga8on (predic8ng word probabili8es) # Each O t is a vector of probabili8es represen8ng the words in our vocabulary, but some8mes, for example when evalua8ng our model, all we want is the next word with the highest probability:
27 RNN Python Example: Parameter Init For each word in the sentence (in our case: 45), our model made 8000 predic8ons represen8ng probabili8es of the next word (currently random as all U,V,W are random)
28 RNN Python Example: Loss Calcula8on For random init: we have C words in our vocabulary, so each word should be (on average) predicted with probability 1/C, which would yield a loss of :
29 RNN Python Example: Training the RNN with SGD and Backpropaga8on Through Time (BPTT) The model: Our loss: O 4 V W U W U
30 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): Calculate the Gradient Find the parameters U,V,W and that minimize the total loss L on the training data using SGD. # It takes as input a training example (x,y) and outputs the gradients =
31 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): Gradient Checking
32 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): SGD
33 Training the RNN with SGD and Backpropaga8on Through Time (BPTT): SGD Seems the loss is decreasing!
34 RNN Python Example: Generate Text
35 RNN Python Example: Generate Text A few selected (censored) sentences with capitaliza8on for reddit comments Anyway, to the city scene you re an idiot teenager. 2. What?!!!! ignore! 3. Screw fitness, you re saying: hops 4. Thanks for the advice to keep my thoughts around girls. 5. Yep, please disappear with the terrible genera8on. What went right? Successfully learn syntax correctly places commas, ends sentence with punctua8on, mul8ple exclama8on marks etc. What went wrong? Sentences don t make sense or have gramma8cal errors! Why? Our vanilla RNN can t generate meaningful text because it s unable to learn dependencies between words that are several steps apart
36
37 Trick for exploding gradient: clipping trick The solu8on first introduced by Mikolov is to clip gradients to a maximum value. Makes a big difference in RNNs.
38 Gradient clipping intui8on Figure from paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al Error surface of a single unit RNN running through a small number of 8me- steps High curvature walls Solid lines: standard gradient descent trajectories Dashed lines gradients rescaled to fixed size
39 For vanishing gradients: Ini8aliza8on + ReLus! Ini8alize W (*) s to iden8ty matrix I and f(z) = rect(z) = max(z, 0) Test Accuracy LSTM RNN + Tanh RNN + ReLUs IRNN Pixel by pixel permuted MNIST à Huge difference! Steps 5 x 10 Ini8aliza8on idea first introduced in Parsing with Composi8onal Vector Grammars, Socher et al New experiments with recurrent neural nets a year ago (!) in A Simple Way to Ini8alize Recurrent Networks of Rec8fied Linear Units, Le et al. 2015
40 Perplexity Results KN5 = Count-based language model with Kneser-Ney smoothing & 5-grams Perplexity the lower the beoer Table from paper Extensions of recurrent neural network language model by Mikolov et al 2011
41 Problem: So[max is huge and slow Trick: Class-based word predic8on p(w t history) = p(c t history)p(w t c t ) = p(c t h t )p(w t c t ) The more classes, the beoer perplexity but also worse speed: If # classes as your vocabulary you don t get anything Richard Socher 4/20/15
42
43 Sequence modeling for other tasks Classify NER the sen8ment of each word in its context opinionated expressions Example applica8on and slides from paper Opinion Mining with Deep Recurrent Nets by Irsoy and Cardie 2014
44 Opinion Mining with Deep Recurrent Nets Goal: Classify each word as direct subjec8ve expressions (DSEs) and expressive subjec8ve expressions (ESEs). DSE: Explicit men8ons of private states or speech events expressing private states ESE: Expressions that indicate sen8ment, emo8on, etc. without explicitly conveying them.
45 Example Annota8on In BIO nota8on (tags either begin-of-en8ty (B_X) or con8nua8on-of-en8ty (I_X)): The commioee, [as usual] ESE, [has refused to make any statements] DSE.
46 Approach: Recurrent Neural Network Nota8on from paper (so you get used to different ones) y h h t = f (Wx t + Vh t-1 + b) y t = g(uh t + c) x x represents a token (word) as a vector. y represents the output label (B, I or O) g = soomax! h is the memory, computed from the past memory and current word. It summarizes the sentence up to that 8me.
47 Bidirec8onal RNNs Problem: For classifica8on you want to incorporate informa8on from words both preceding and following Ideas? y h x now represents (summarizes) the past and future around a single token.
48 Deep Bidirec8onal RNNs y h (3) h (2) h (1) x Each memory layer passes an intermediate sequential representation to the next.
49 Data MPQA 1.2 corpus (Wiebe et al., 2005) consists of 535 news ar8cles (11,111 sentences) manually labeled at the phrase level (with DSE and ESEs) Evalua8on: F1
50 Evalua8on Bin F # Layers # Layers 24k 200k No8ces that aoer several layers the results start to get worse (overfit?)
51 Summary Recurrent Neural Networks are powerful Next up: Fancy Recursive Neural Networks! (advanced stuff)
Recurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationNatural Language Processing with Deep Learning. CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Christopher Manning and Richard Socher Classifica6on setup and nota6on Generally
More informationLong Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch
Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationNatural Language Processing with Deep Learning. CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Christopher Manning and Richard Socher Overview Today: Classifica(on background Upda(ng
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationNatural Language Processing
Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting
More informationRecurrent Neural Networks. Armand Joulin Facebook AI research
Recurrent Neural Networks Armand Joulin Facebook AI research Introduc9on How to train a deep network: Forward pass: apply your network to your data Evalua.on: Compute the error given by your loss func9on
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationNatural Language Processing with Deep Learning CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationLanguage Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high
More informationNeural Networks 2. 2 Receptive fields and dealing with image inputs
CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationDeep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning
Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationCSCI 315: Artificial Intelligence through Deep Learning
CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7) Recall our first-week discussion... How do we know stuff? (MIT Press 1996)
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationLecture 15: Recurrent Neural Nets
Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationRecurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves
Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationNatural Language Processing
Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Recurrent Neural Network 1 NLP-ML marriage 2 An eample SMS complaint n I have purchased a 80 litre Videocon fridge about
More informationNatural Language Processing with Deep Learning. CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 9: Recap and Fancy Recurrent Neural Networks for Machine Transla
More informationCSC321 Lecture 10 Training RNNs
CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw
More informationDeep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationLanguage Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1
Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic
More information(
Class 15 - Long Short-Term Memory (LSTM) Study materials http://colah.github.io/posts/2015-08-understanding-lstms/ (http://colah.github.io/posts/2015-08-understanding-lstms/) http://karpathy.github.io/2015/05/21/rnn-effectiveness/
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationCOMP 562: Introduction to Machine Learning
COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationTime to Re4lect. Why does a neural net work? What is it good at? Empirical vs theory: what do we know?
Text Similarity Announcements Note problems with Midterm mul2ple choice ques2ons 2 and 3. If you got them wrong, you will get credit. Bring your exam back to TA hours. There will be a recita2on tomorrow
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39 HW1 turned in HW2 released Office
More informationMachine Learning and Data Mining. Linear regression. Prof. Alexander Ihler
+ Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve
More informationNatural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs
Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationMA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models
MA/CS 109 Lecture 7 Back To Exponen:al Growth Popula:on Models Homework this week 1. Due next Thursday (not Tuesday) 2. Do most of computa:ons in discussion next week 3. If possible, bring your laptop
More informationLearning Long-Term Dependencies with Gradient Descent is Difficult
Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua
More informationContents. (75pts) COS495 Midterm. (15pts) Short answers
Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 11: Hidden Markov Models 1 Kaggle Compe==on Part 1 2 Kaggle Compe==on Part 2 3 Announcements Updated Kaggle Report Due Date: 9pm on Monday Feb 13 th
More informationCS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam
CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.
More information(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses
Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationLogis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationRecurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)
Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following
More informationCSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.
CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationCSE 473: Ar+ficial Intelligence
CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationLanguage Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high
More informationOverview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop
Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Project Tips Announcement: Hint for PSet1: Understand
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationCSE546: SVMs, Dual Formula5on, and Kernels Winter 2012
CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the
More informationLanguage Models. Philipp Koehn. 11 September 2018
Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationLanguage Model. Introduction to N-grams
Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationComputing Neural Network Gradients
Computing Neural Network Gradients Kevin Clark 1 Introduction The purpose of these notes is to demonstrate how to quickly compute neural network gradients in a completely vectorized way. It is complementary
More informationRecurrent Neural Networks
Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1 x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationNatural Language Processing
Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);
More informationDeep Learning Recurrent Networks 10/11/2017
Deep Learning Recurrent Networks 10/11/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More information