Natural Language Processing

Size: px
Start display at page:

Download "Natural Language Processing"

Transcription

1 Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer

2 Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting 2

3 Motivations Define a probability distribution over sentences Why? p(01010 ) / p( 01010) p(01010) Machine translation P( high winds ) > P( large winds ) Spelling correction P( The office is fifteen minutes from here ) > P( The office is fifteen minuets from here ) Speech recognition (that s where it started!) P( recognize speech ) > P( wreck a nice beach ) And more! 3

4 Motivations Techniques will be useful later 4

5 Problem definition Given a finite vocabulary V = {the, a, man, telescope, Beckham, two,...} We have an infinite language L, which is V * concatenated with the special symbol STOP the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP 5

6 Problem definition Input: a training set of example sentences Currently: hundreds of billions of words Output: a probability distribution p over L X x2l p(x) =1,p(x) 0 for all x 2 L p( the STOP ) = p( the fan saw Beckham STOP ) = 2 x 10-8 p( the fan saw saw STOP ) =

7 A naive method Assume we have N training sentences Let x1, x2,, xn be a sentence, and c(x1, x2,, xn) be the number of times it appeared in the training data. Define a language model: p(x 1,...,x n )= c(x 1,...,x n ) N No generalization! 7

8 Markov processes Markov processes: Given a sequence of n random variables: We want a sequence probability model X 1,X 2,...,X n, n = 100, X i 2 V p(x 1 = x 1,X 2 = x 2, X n = x n ) There are V n possible sequences 8

9 First-order Markov process Chain rule p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x 1 = x 1 ) p(x i = x i X 1 = x 1,...,X i 1 = x i 1 ) i=2 Markov assumption p(x i = x i X 1 = x 1,...,X i 1 = x i 1 )=p(x i = x i X i 1 = x i 1 ) 9

10 Second-order Markov process Relax independence assumption: p(x 1 = x 1,X 2 = x 2,...,X n = x n )= p(x 1 = x 1 ) p(x 2 = x 2 X 1 = x 1 ) ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=3 Simplify notation x 0 =,x 1 = Is this reasonable? 10

11 Detail: variable length Probability distribution over sequences of any length Define always X n =STOP, and obtain a probability distribution over all sequences Intuition: at every step you have probability α h to stop (conditioned on history) and (1-α h ) to keep going p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=1 11

12 Trigram language model A trigram language model contains A vocabulary V A non negative parameters q(w u,v) for every trigram, such that w 2 V [ {STOP}, u,v 2 V [ { } The probability of a sentence x1,, xn, where xn=stop is ny p(x 1,...,x n )= q(x i x i 1,x i 2 ) i=1 12

13 Example p(the dog barks STOP) =q(the, ) q(dog,the) q(barks the, dog) q(stop dog, barks) 13

14 Limitation Markovian assumption is false He is from France, so it makes sense that his first language is We would want to model longer dependencies 14

15 Sparseness Maximum likelihood for estimating q Let c(w_1,, w_n) be the number of times that n-gram appears in a corpus q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) If vocabulary has 20,000 words > Number of parameters is 8 x 10 12! Most sentences will have zero or undefined probabilities 15

16 Berkeley restaurant project sentences can you tell me about any good cantonese restaurants close by mid priced that food is what i m looking for tell me about chez pansies can you give me a listing of the kinds of food that are available i m looking for a good place to eat breakfast when is caffe venezia open during the day 16

17 Bigram counts 17

18 Bigram probabilities 18

19 What did we learn p(english want) < p(chinese want) - people like Chinese stuff more when it comes to this corpus p(to want) = English behaves in a certain way p(eat to) = English behaves in a certain way 19

20 Evaluation: perplexity Test data: S = {s 1, s 2,, s M } Parameters are not estimated from S A good language model has high p(s) and low perplexity Perplexity is the normalized inverse probability of S p(s) = log 2 p(s) = MY p(s i ) M is the number of words in the corpus i=1 MX log 2 p(s i ) i=1 perplexity = 2 l, l = 1 M 20 MX log 2 p(s i ) i=1

21 Evaluation: perplexity Say we have a vocabulary V and N = V + 1 and a trigram model with uniform distribution q(w u, v) = 1/N. l = 1 M MX i=1 = 1 M log 1 N log 2 p(s i ) M = log 1 perplexity = 2 l =2 log N = N N Perplexity is the effective vocabulary size. 21

22 Typical values of perplexity When V = 50,000 trigram model perplexity: 74 (<< 50000) bigram model: 137 unigram model:

23 Evaluation Extrinsic evaluations: MT, speech, spelling correction, 23

24 History Shannon (1950) estimated the perplexity score that humans get for printed English (we are good!) Test your perplexity 24

25 Estimating parameters Recall that the number of parameters for a trigram model with V = 20,000 is 8 x 10 12, leading to zeros and undefined probabilities q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) 25

26 Bias-variance tradeoff Given a corpus of length M Trigram model: q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 1,w i ) Bigram model: q(w i w i 1 )= c(w i 1,w i ) c(w i 1 ) Unigram model: q(w i )= c(w i) 26 M

27 27

28 Linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, =1 Combine the three models to get all benefits 28

29 Linear interpolation Need to verify the parameters define a probability distribution X w2v q LI (w u, v) = X w2v 1 q(w u, v)+ 2 q(w v)+ 3 q(w) = 1 X q(w u, v)+ 2 X q(w v)+ 3 X q(w) w2v w2v w2v = =1 29

30 Estimating coefficients Use validation/development set (intro to ML!) Partition training data to training (90%?) and dev (10%?) data and optimize the coefficient to minimize the perplexity (the measure we care about!) on the development data 30

31 Discounting methods x c(x) q(wi wi-1) the 48 the, dog 15 15/48 the, woman 11 11/48 the, man 10 10/48 the, park 5 5/48 the, job 2 2/48 the, telescope 1 1/48 the, manual 1 1/48 the, afternoon 1 1/48 the, country 1 1/48 the, street 1 1/48 Low count bigrams have high estimates 31

32 Discounting methods c*(x) = c(x) x c(x) c*(x) q(wi wi-1) the 48 the, dog /48 the, woman /48 the, man /48 the, park /48 the, job /48 the, telescope /48 the, manual /48 the, afternoon /48 the, country /48 the, street /48 32

33 Katz back-off In a bigram model A(w i 1 )={w : c(w i 1,w) > 0} B(w i 1 )={w : c(w i 1,w)=0} q BO (w i w i 1 )= ( c (w i 1,w i ) c(w i 1 ) w i 2 A(w i 1 ) q(w i ) (w i 1 ) Pw2B(w i 1 ) q(w) w i 2 B(w i 1 ) (w i 1 )=1 X w2a(w i 1 ) c (w i 1,w) c(w i 1 ) 33

34 Katz back-off In a trigram model A(w i 2,w i 1 )={w : c(w i 2,w i 1,w) > 0} B(w i 2,w i 1 )={w : c(w i 2,w i 1,w)=0} q BO (w i w i 2,w i 1 )= (w i 2,w i 1 )=1 ( c (w i 2,w i 1,w i ) c(w i 2,w i 1 ) w i 2 A(w i 2,w i 1 ) q (w i 2,w i 1 ) BO (w i w i 1 ) Pw2B(w i 2,w i 1 ) q BO(w w i 1 ) w i 2 B(w i 2,w i 1 ) X w2a(w i 2,w i 1 ) c (w i 2,w i 1,w) c(w i 2,w i 1 ) 34

35 Class 3

36 Debt 1: Gradient =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 36

37 Debt 2: matrix factorization We are in a sample space of pairs #(c,o) #(c) = o #(c,o) #(o) = c #(c,o) T = c,o #(c,o) Let m be the context size, D be the corpus length, and c(w) be the number of times word w appears in the corpus p(w) = c(w) D = m c(w) m D = #(o) T

38 Debt 3: linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, =1

39 Debt 3: linear interpolation (w i 2,w i 1 )= ( 1 c(w i 2,w i 1 )=0 2 1 apple c(w i 2,w i 1 ) apple apple c(w i 2,w i 1 ) apple 10 4 otherwise q LI (w i w i 2,w i 1 )= (w i 2,w i 1 ) 1 q(w i w i 2,w i 1 ) (w i 2,w i 1 ) i 0 + (w i 2,w i 1 ) 2 q(w i w i 1 ) + (w i 2,w i 1 ) 3 q(w i ) (w i 2,w i 1 ) 1 + (w i 2,w i 1 ) 2 + (w i 2,w i 1 ) 3 =1

40 Debt 4: unknown words What if we see a completely new words at test time? p(s) = p(s) = 0 > infinite perplexity Solution: create a special token <unk> Fix vocabulary V and replace any word in the training set not in V with <unk> Train At test time, use p(<unk>) for words not in the training set

41 Administration Projects word vectors language models Late HW you can have 5 late days during the semester. Afterwards it is 5 points deduction per day.

42 Summary and re-cap Sentence probability: decompose with the chain rule Use a Markov independence assumption Smooth estimates to do better on rare events Many attempts to improve LMs using syntax but not easy More complex smoothing methods exist for handling rare events (Kneser-Ney, Good-Turing ) 42

43 Problem Our estimator q(w u, v) is based on a one-hot representation. There is no relation between words. p(played the, actress) p(played the actor) Can we use distributed representations? Neural networks 43

44 Plan for today Feed-forward neural networks for language modeling Recurrent neural networks for language modeling Vanishing/exploding gradients Computation graphs

45 Neural networks: history Proposed in the mid-20th century, advanced by back propagation in the 1980s In the 1990s SVMs appeared and were more successful Since the beginning of this decade show great success in speech, vision, and language 45

46 Motivation 1 Can we learn representations from raw data? 46

47 Motivation 2 Can we learn non-linear decision boundaries Actually very similar to motivation 1 47

48 A single neuron A neuron is a computational unit of the form f w,b (x) =f(w > x + b) x: input vector w: weights vector b: bias f: activation function 48 intro to ml

49 A single neuron If f is sigmoid, a neuron is logistic regression Let x, y be a binary classification training example p(y =1 x) = 1 1+e w> x b = (w> x + b) Provides a linear decision boundary 49 intro to ml

50 Single layer network Perform logistic regression in parallel multiple times (with different parameters) y = (ŵ >ˆx + b) = (w > x) x, w 2 R d+1, w d+1 =1, x d+1 = b 50

51 Single layer network L1: input layer L2: hidden layer L3: output layer Output layer provides the prediction Hidden layer is the learned representation 51

52 Multi-layer network Repeat: 52

53 Matrix notation a 1 = f(w 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) a 2 = f(w 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 3 = f(w 31 x 1 + W 32 x 2 + W 33 x 3 + b 3 ) z = Wx+ b, a = f(z) x 2 R 4, z 2 R 3, W 2 R

54 Language modeling with NNs Keep the Markov assumption Learn a probability q(u v, w) with distributed representations 54

55 Bengio et al, 2003 Language modeling with NNs e(w) =W e w,, W e 2 R d V,w 2 R V 1 h(w i 2,w i 1 )= (W h [e(w i 2 ); e(w i 2 )]), W h 2 R m 2d f(z) = softmax(w o z), W o 2 R V m,z 2 R m 1 p(w i w i 2,w i 1 )=f(h(w i 2,w i 1 )) i f(h(the, dog)) laughed p(laughed the, dog) h(the, dog) e(the) e(dog) 55 the dog

56 Loss function Minimize negative log-likelihood When we learned word vectors, we were basically training a language model 56

57 Advantages If we see in the training set The cat is walking in the bedroom We can hope to learn A dog was running through a room We can use n-grams with n > 3 and pay only a linear price Compared to what? 57

58 Parameter estimation We train with SGD How to efficiently compute the gradients? backpropagation (Rumelhart, Hinton and Williams, 1986) Proof in intro to ML, will repeat algorithm only and give an example here 58

59 Backpropagation Notation: W t : weight matrix at the input of layer t z t : output vector at layer t x = z 0 : input vector y : gold scalar ŷ = z L : predicted scalar l(y, ŷ) : loss function v t = W t z t t z t 1 : pre-activations : gradient vector 59

60 Backpropagation Run the network forward to obtain all values v t, z t Base: Recursion: L = l 0 (y, z L ) t = W > t+1( 0 (v t+1 ) t+1) 0 (v t+1 ), t+1 2 R d t+1 1,W t+1 2 R d t+1 d t =( t 0 (v t ))z > t 1 60

61 Bigram LM example Forward pass: z 0 2 R V 1 : one-hot vector input z 1 = W 1 z 0,W 1 2 R d 1 V,z 1 2 R d 1 1 z 2 = (W 2 z 1 ),W 2 2 R d 2 d 1,z 2 2 R d 2 1 z 3 = softmax(w 3 z 2 ),W 3 2 R V d 2,z 3 2 R V 1 l(y, z 3 )= X i y (i) log z (i) 3 61

62 Bigram LM example Backward pass: 0 (v 3 ) 3 =(z 3 y) 2 = W 3 > ( 0 (v 3 ) 3) =W 3 > (z 3 y) 1 = W 2 > ( 0 (v 2 ) 2) =W 2 > (z 2 (1 z 2 ) =( 3 0 (v 3 ))z 2 > =(z 3 y)z 2 =( 2 0 (v 2 ))z 1 > =( 2 z 2 (1 z 2 ))z 1 =( 1 0 (v 1 ))z 0 > = 1 z 0 > 62

63 Summary Neural nets can improve language models: better scalability for larger N use of word similarity complex decision boundaries Training through backpropagation But we still have a Markov assumption He is from France, so it makes sense that his first language is 63

64 Recurrent neural networks the dog the dog laughed Input: w 1,...,w t 1,w t,w t+1,...,w T,w i 2 R V Model: x t = W (e) w t,w (e) 2 R d V h t = (W (hh) h t 1 + W (hx) x t ),W (hh) 2 R D h D h,w (hx) 2 R D h d ŷ t = softmax(w (s) h t ),W (s) 2 R V D h 64 64

65 Recurrent neural networks

66 Recurrent neural networks Can exploit long range dependencies Each layer has the same weights (weight sharing/ tying) What is the loss function? TX J( ) = CE(y t, ŷ t ) t=1

67 Recurrent neural networks Component: model? RNN (saw before) Loss? sum of cross-entropy over time Optimization? SGD Gradient computation? back-propagation through time

68 Training RNNs Capturing long-range dependencies with RNNs is difficult Vanishing/exploding gradients Small changes in hidden layer value in step k cause huge/minuscule changes to values in hidden layer t for t >> k

69 Explanation Consider a simple linear RNN with no input: h t = W h t 1 h t = W t h 0 h t =(Q Q 1 ) t h 0 h t =(Q t Q 1 ) h 0 where W = Q Q 1 is an eigendecomposition Some eigenvalues will explode and some will shrink to zero Stretch input in the direction of eigenvector with largest eigenvalue

70 Pascanu et al, 2013 Explanation h t = W (hh) (h t 1 )+W (hx) x t ( k = tx k=1 ty t ( i 1 ty i=k+1 W (hh)> diag( 0 (h i 1 i apple W (hh) > diag( 0 (h i 1 )) apple 1 The contribution of step k to the total gradient at step t is small

71 Solutions Exploding gradient: gradient clipping Re-normalize gradient to be less than C (this is no longer the true gradient) Exploding gradients are easy to detect The problem is with the model! Change it (LSTMs, GRUs) Maybe later in this class

72 Illustration Pascanu et al, 2013

73 Results Jozefowicz et al, 2016

74 Summary Language modeling is a fundamental NLP task used in machine translation, spelling correction, speech recognition, etc. Traditional models use n-gram counts and smoothing Feed-forward take into account word similarity to generalize better Recurrent models can potentially learn to exploit long-range interactions Neural models dramatically reduced perplexity Recurrent networks are now used in many other NLP tasks (bidirectional RNNs, deep RNNs) 74

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

CS 6120/CS4120: Natural Language Processing

CS 6120/CS4120: Natural Language Processing CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Outline Probabilistic language

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer + Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams) CSCI 5832 Natural Language Processing Jim Martin Lecture 6 1 Today 1/31 Probability Basic probability Conditional probability Bayes Rule Language Modeling (N-grams) N-gram Intro The Chain Rule Smoothing:

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

{ Jurafsky & Martin Ch. 6:! 6.6 incl. N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1 Word-prediction Applications Augmentative Communication

More information

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high

More information

Language Models. Philipp Koehn. 11 September 2018

Language Models. Philipp Koehn. 11 September 2018 Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Introduction to N-grams

Introduction to N-grams Lecture Notes, Artificial Intelligence Course, University of Birzeit, Palestine Spring Semester, 2014 (Advanced/) Artificial Intelligence Probabilistic Language Modeling Introduction to N-grams Dr. Mustafa

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin) Language Modeling Introduction to N-grams Klinton Bicknell (borrowing from: Dan Jurafsky and Jim Martin) Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation:

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

language modeling: n-gram models

language modeling: n-gram models language modeling: n-gram models CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

CS 6120/CS4120: Natural Language Processing

CS 6120/CS4120: Natural Language Processing CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Today s Outline Probabilistic

More information

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams Language Modeling Introduction to N- grams Probabilistic Language Models Today s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Why? Spell Correction

More information

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams Language Modeling Introduction to N- grams Probabilistic Language Models Today s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Why? Spell Correction

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

CSC321 Lecture 7 Neural language models

CSC321 Lecture 7 Neural language models CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We

More information

Intro to Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs

More information

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc) NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7 Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

Neural Networks. Intro to AI Bert Huang Virginia Tech

Neural Networks. Intro to AI Bert Huang Virginia Tech Neural Networks Intro to AI Bert Huang Virginia Tech Outline Biological inspiration for artificial neural networks Linear vs. nonlinear functions Learning with neural networks: back propagation https://en.wikipedia.org/wiki/neuron#/media/file:chemical_synapse_schema_cropped.jpg

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1 Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques

More information

CMPT-825 Natural Language Processing

CMPT-825 Natural Language Processing CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018 Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

Chapter 3: Basics of Language Modeling

Chapter 3: Basics of Language Modeling Chapter 3: Basics of Language Modeling Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models (most slides from Sharon Goldwater; some adapted from Alex Lascarides) 29 January 2017 Nathan Schneider ENLP Lecture 5

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

A fast and simple algorithm for training neural probabilistic language models

A fast and simple algorithm for training neural probabilistic language models A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1

More information