Natural Language Processing
|
|
- Amos Hutchinson
- 5 years ago
- Views:
Transcription
1 Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer
2 Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting 2
3 Motivations Define a probability distribution over sentences Why? p(01010 ) / p( 01010) p(01010) Machine translation P( high winds ) > P( large winds ) Spelling correction P( The office is fifteen minutes from here ) > P( The office is fifteen minuets from here ) Speech recognition (that s where it started!) P( recognize speech ) > P( wreck a nice beach ) And more! 3
4 Motivations Techniques will be useful later 4
5 Problem definition Given a finite vocabulary V = {the, a, man, telescope, Beckham, two,...} We have an infinite language L, which is V * concatenated with the special symbol STOP the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP 5
6 Problem definition Input: a training set of example sentences Currently: hundreds of billions of words Output: a probability distribution p over L X x2l p(x) =1,p(x) 0 for all x 2 L p( the STOP ) = p( the fan saw Beckham STOP ) = 2 x 10-8 p( the fan saw saw STOP ) =
7 A naive method Assume we have N training sentences Let x1, x2,, xn be a sentence, and c(x1, x2,, xn) be the number of times it appeared in the training data. Define a language model: p(x 1,...,x n )= c(x 1,...,x n ) N No generalization! 7
8 Markov processes Markov processes: Given a sequence of n random variables: We want a sequence probability model X 1,X 2,...,X n, n = 100, X i 2 V p(x 1 = x 1,X 2 = x 2, X n = x n ) There are V n possible sequences 8
9 First-order Markov process Chain rule p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x 1 = x 1 ) p(x i = x i X 1 = x 1,...,X i 1 = x i 1 ) i=2 Markov assumption p(x i = x i X 1 = x 1,...,X i 1 = x i 1 )=p(x i = x i X i 1 = x i 1 ) 9
10 Second-order Markov process Relax independence assumption: p(x 1 = x 1,X 2 = x 2,...,X n = x n )= p(x 1 = x 1 ) p(x 2 = x 2 X 1 = x 1 ) ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=3 Simplify notation x 0 =,x 1 = Is this reasonable? 10
11 Detail: variable length Probability distribution over sequences of any length Define always X n =STOP, and obtain a probability distribution over all sequences Intuition: at every step you have probability α h to stop (conditioned on history) and (1-α h ) to keep going p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=1 11
12 Trigram language model A trigram language model contains A vocabulary V A non negative parameters q(w u,v) for every trigram, such that w 2 V [ {STOP}, u,v 2 V [ { } The probability of a sentence x1,, xn, where xn=stop is ny p(x 1,...,x n )= q(x i x i 1,x i 2 ) i=1 12
13 Example p(the dog barks STOP) =q(the, ) q(dog,the) q(barks the, dog) q(stop dog, barks) 13
14 Limitation Markovian assumption is false He is from France, so it makes sense that his first language is We would want to model longer dependencies 14
15 Sparseness Maximum likelihood for estimating q Let c(w_1,, w_n) be the number of times that n-gram appears in a corpus q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) If vocabulary has 20,000 words > Number of parameters is 8 x 10 12! Most sentences will have zero or undefined probabilities 15
16 Berkeley restaurant project sentences can you tell me about any good cantonese restaurants close by mid priced that food is what i m looking for tell me about chez pansies can you give me a listing of the kinds of food that are available i m looking for a good place to eat breakfast when is caffe venezia open during the day 16
17 Bigram counts 17
18 Bigram probabilities 18
19 What did we learn p(english want) < p(chinese want) - people like Chinese stuff more when it comes to this corpus p(to want) = English behaves in a certain way p(eat to) = English behaves in a certain way 19
20 Evaluation: perplexity Test data: S = {s 1, s 2,, s M } Parameters are not estimated from S A good language model has high p(s) and low perplexity Perplexity is the normalized inverse probability of S p(s) = log 2 p(s) = MY p(s i ) M is the number of words in the corpus i=1 MX log 2 p(s i ) i=1 perplexity = 2 l, l = 1 M 20 MX log 2 p(s i ) i=1
21 Evaluation: perplexity Say we have a vocabulary V and N = V + 1 and a trigram model with uniform distribution q(w u, v) = 1/N. l = 1 M MX i=1 = 1 M log 1 N log 2 p(s i ) M = log 1 perplexity = 2 l =2 log N = N N Perplexity is the effective vocabulary size. 21
22 Typical values of perplexity When V = 50,000 trigram model perplexity: 74 (<< 50000) bigram model: 137 unigram model:
23 Evaluation Extrinsic evaluations: MT, speech, spelling correction, 23
24 History Shannon (1950) estimated the perplexity score that humans get for printed English (we are good!) Test your perplexity 24
25 Estimating parameters Recall that the number of parameters for a trigram model with V = 20,000 is 8 x 10 12, leading to zeros and undefined probabilities q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) 25
26 Bias-variance tradeoff Given a corpus of length M Trigram model: q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 1,w i ) Bigram model: q(w i w i 1 )= c(w i 1,w i ) c(w i 1 ) Unigram model: q(w i )= c(w i) 26 M
27 27
28 Linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, =1 Combine the three models to get all benefits 28
29 Linear interpolation Need to verify the parameters define a probability distribution X w2v q LI (w u, v) = X w2v 1 q(w u, v)+ 2 q(w v)+ 3 q(w) = 1 X q(w u, v)+ 2 X q(w v)+ 3 X q(w) w2v w2v w2v = =1 29
30 Estimating coefficients Use validation/development set (intro to ML!) Partition training data to training (90%?) and dev (10%?) data and optimize the coefficient to minimize the perplexity (the measure we care about!) on the development data 30
31 Discounting methods x c(x) q(wi wi-1) the 48 the, dog 15 15/48 the, woman 11 11/48 the, man 10 10/48 the, park 5 5/48 the, job 2 2/48 the, telescope 1 1/48 the, manual 1 1/48 the, afternoon 1 1/48 the, country 1 1/48 the, street 1 1/48 Low count bigrams have high estimates 31
32 Discounting methods c*(x) = c(x) x c(x) c*(x) q(wi wi-1) the 48 the, dog /48 the, woman /48 the, man /48 the, park /48 the, job /48 the, telescope /48 the, manual /48 the, afternoon /48 the, country /48 the, street /48 32
33 Katz back-off In a bigram model A(w i 1 )={w : c(w i 1,w) > 0} B(w i 1 )={w : c(w i 1,w)=0} q BO (w i w i 1 )= ( c (w i 1,w i ) c(w i 1 ) w i 2 A(w i 1 ) q(w i ) (w i 1 ) Pw2B(w i 1 ) q(w) w i 2 B(w i 1 ) (w i 1 )=1 X w2a(w i 1 ) c (w i 1,w) c(w i 1 ) 33
34 Katz back-off In a trigram model A(w i 2,w i 1 )={w : c(w i 2,w i 1,w) > 0} B(w i 2,w i 1 )={w : c(w i 2,w i 1,w)=0} q BO (w i w i 2,w i 1 )= (w i 2,w i 1 )=1 ( c (w i 2,w i 1,w i ) c(w i 2,w i 1 ) w i 2 A(w i 2,w i 1 ) q (w i 2,w i 1 ) BO (w i w i 1 ) Pw2B(w i 2,w i 1 ) q BO(w w i 1 ) w i 2 B(w i 2,w i 1 ) X w2a(w i 2,w i 1 ) c (w i 2,w i 1,w) c(w i 2,w i 1 ) 34
35 Class 3
36 Debt 1: Gradient =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 36
37 Debt 2: matrix factorization We are in a sample space of pairs #(c,o) #(c) = o #(c,o) #(o) = c #(c,o) T = c,o #(c,o) Let m be the context size, D be the corpus length, and c(w) be the number of times word w appears in the corpus p(w) = c(w) D = m c(w) m D = #(o) T
38 Debt 3: linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, =1
39 Debt 3: linear interpolation (w i 2,w i 1 )= ( 1 c(w i 2,w i 1 )=0 2 1 apple c(w i 2,w i 1 ) apple apple c(w i 2,w i 1 ) apple 10 4 otherwise q LI (w i w i 2,w i 1 )= (w i 2,w i 1 ) 1 q(w i w i 2,w i 1 ) (w i 2,w i 1 ) i 0 + (w i 2,w i 1 ) 2 q(w i w i 1 ) + (w i 2,w i 1 ) 3 q(w i ) (w i 2,w i 1 ) 1 + (w i 2,w i 1 ) 2 + (w i 2,w i 1 ) 3 =1
40 Debt 4: unknown words What if we see a completely new words at test time? p(s) = p(s) = 0 > infinite perplexity Solution: create a special token <unk> Fix vocabulary V and replace any word in the training set not in V with <unk> Train At test time, use p(<unk>) for words not in the training set
41 Administration Projects word vectors language models Late HW you can have 5 late days during the semester. Afterwards it is 5 points deduction per day.
42 Summary and re-cap Sentence probability: decompose with the chain rule Use a Markov independence assumption Smooth estimates to do better on rare events Many attempts to improve LMs using syntax but not easy More complex smoothing methods exist for handling rare events (Kneser-Ney, Good-Turing ) 42
43 Problem Our estimator q(w u, v) is based on a one-hot representation. There is no relation between words. p(played the, actress) p(played the actor) Can we use distributed representations? Neural networks 43
44 Plan for today Feed-forward neural networks for language modeling Recurrent neural networks for language modeling Vanishing/exploding gradients Computation graphs
45 Neural networks: history Proposed in the mid-20th century, advanced by back propagation in the 1980s In the 1990s SVMs appeared and were more successful Since the beginning of this decade show great success in speech, vision, and language 45
46 Motivation 1 Can we learn representations from raw data? 46
47 Motivation 2 Can we learn non-linear decision boundaries Actually very similar to motivation 1 47
48 A single neuron A neuron is a computational unit of the form f w,b (x) =f(w > x + b) x: input vector w: weights vector b: bias f: activation function 48 intro to ml
49 A single neuron If f is sigmoid, a neuron is logistic regression Let x, y be a binary classification training example p(y =1 x) = 1 1+e w> x b = (w> x + b) Provides a linear decision boundary 49 intro to ml
50 Single layer network Perform logistic regression in parallel multiple times (with different parameters) y = (ŵ >ˆx + b) = (w > x) x, w 2 R d+1, w d+1 =1, x d+1 = b 50
51 Single layer network L1: input layer L2: hidden layer L3: output layer Output layer provides the prediction Hidden layer is the learned representation 51
52 Multi-layer network Repeat: 52
53 Matrix notation a 1 = f(w 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) a 2 = f(w 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 3 = f(w 31 x 1 + W 32 x 2 + W 33 x 3 + b 3 ) z = Wx+ b, a = f(z) x 2 R 4, z 2 R 3, W 2 R
54 Language modeling with NNs Keep the Markov assumption Learn a probability q(u v, w) with distributed representations 54
55 Bengio et al, 2003 Language modeling with NNs e(w) =W e w,, W e 2 R d V,w 2 R V 1 h(w i 2,w i 1 )= (W h [e(w i 2 ); e(w i 2 )]), W h 2 R m 2d f(z) = softmax(w o z), W o 2 R V m,z 2 R m 1 p(w i w i 2,w i 1 )=f(h(w i 2,w i 1 )) i f(h(the, dog)) laughed p(laughed the, dog) h(the, dog) e(the) e(dog) 55 the dog
56 Loss function Minimize negative log-likelihood When we learned word vectors, we were basically training a language model 56
57 Advantages If we see in the training set The cat is walking in the bedroom We can hope to learn A dog was running through a room We can use n-grams with n > 3 and pay only a linear price Compared to what? 57
58 Parameter estimation We train with SGD How to efficiently compute the gradients? backpropagation (Rumelhart, Hinton and Williams, 1986) Proof in intro to ML, will repeat algorithm only and give an example here 58
59 Backpropagation Notation: W t : weight matrix at the input of layer t z t : output vector at layer t x = z 0 : input vector y : gold scalar ŷ = z L : predicted scalar l(y, ŷ) : loss function v t = W t z t t z t 1 : pre-activations : gradient vector 59
60 Backpropagation Run the network forward to obtain all values v t, z t Base: Recursion: L = l 0 (y, z L ) t = W > t+1( 0 (v t+1 ) t+1) 0 (v t+1 ), t+1 2 R d t+1 1,W t+1 2 R d t+1 d t =( t 0 (v t ))z > t 1 60
61 Bigram LM example Forward pass: z 0 2 R V 1 : one-hot vector input z 1 = W 1 z 0,W 1 2 R d 1 V,z 1 2 R d 1 1 z 2 = (W 2 z 1 ),W 2 2 R d 2 d 1,z 2 2 R d 2 1 z 3 = softmax(w 3 z 2 ),W 3 2 R V d 2,z 3 2 R V 1 l(y, z 3 )= X i y (i) log z (i) 3 61
62 Bigram LM example Backward pass: 0 (v 3 ) 3 =(z 3 y) 2 = W 3 > ( 0 (v 3 ) 3) =W 3 > (z 3 y) 1 = W 2 > ( 0 (v 2 ) 2) =W 2 > (z 2 (1 z 2 ) =( 3 0 (v 3 ))z 2 > =(z 3 y)z 2 =( 2 0 (v 2 ))z 1 > =( 2 z 2 (1 z 2 ))z 1 =( 1 0 (v 1 ))z 0 > = 1 z 0 > 62
63 Summary Neural nets can improve language models: better scalability for larger N use of word similarity complex decision boundaries Training through backpropagation But we still have a Markov assumption He is from France, so it makes sense that his first language is 63
64 Recurrent neural networks the dog the dog laughed Input: w 1,...,w t 1,w t,w t+1,...,w T,w i 2 R V Model: x t = W (e) w t,w (e) 2 R d V h t = (W (hh) h t 1 + W (hx) x t ),W (hh) 2 R D h D h,w (hx) 2 R D h d ŷ t = softmax(w (s) h t ),W (s) 2 R V D h 64 64
65 Recurrent neural networks
66 Recurrent neural networks Can exploit long range dependencies Each layer has the same weights (weight sharing/ tying) What is the loss function? TX J( ) = CE(y t, ŷ t ) t=1
67 Recurrent neural networks Component: model? RNN (saw before) Loss? sum of cross-entropy over time Optimization? SGD Gradient computation? back-propagation through time
68 Training RNNs Capturing long-range dependencies with RNNs is difficult Vanishing/exploding gradients Small changes in hidden layer value in step k cause huge/minuscule changes to values in hidden layer t for t >> k
69 Explanation Consider a simple linear RNN with no input: h t = W h t 1 h t = W t h 0 h t =(Q Q 1 ) t h 0 h t =(Q t Q 1 ) h 0 where W = Q Q 1 is an eigendecomposition Some eigenvalues will explode and some will shrink to zero Stretch input in the direction of eigenvector with largest eigenvalue
70 Pascanu et al, 2013 Explanation h t = W (hh) (h t 1 )+W (hx) x t ( k = tx k=1 ty t ( i 1 ty i=k+1 W (hh)> diag( 0 (h i 1 i apple W (hh) > diag( 0 (h i 1 )) apple 1 The contribution of step k to the total gradient at step t is small
71 Solutions Exploding gradient: gradient clipping Re-normalize gradient to be less than C (this is no longer the true gradient) Exploding gradients are easy to detect The problem is with the model! Change it (LSTMs, GRUs) Maybe later in this class
72 Illustration Pascanu et al, 2013
73 Results Jozefowicz et al, 2016
74 Summary Language modeling is a fundamental NLP task used in machine translation, spelling correction, speech recognition, etc. Traditional models use n-gram counts and smoothing Feed-forward take into account word similarity to generalize better Recurrent models can potentially learn to exploit long-range interactions Neural models dramatically reduced perplexity Recurrent networks are now used in many other NLP tasks (bidirectional RNNs, deep RNNs) 74
Language Modeling. Michael Collins, Columbia University
Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationThe Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview
The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationNatural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationLanguage Model. Introduction to N-grams
Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction
More informationCS 6120/CS4120: Natural Language Processing
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Outline Probabilistic language
More informationNatural Language Processing SoSe Words and Language Model
Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence
More informationNgram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer
+ Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationCSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)
CSCI 5832 Natural Language Processing Jim Martin Lecture 6 1 Today 1/31 Probability Basic probability Conditional probability Bayes Rule Language Modeling (N-grams) N-gram Intro The Chain Rule Smoothing:
More informationNatural Language Processing
Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More information{ Jurafsky & Martin Ch. 6:! 6.6 incl.
N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1 Word-prediction Applications Augmentative Communication
More informationLanguage Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high
More informationLanguage Models. Philipp Koehn. 11 September 2018
Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationIntroduction to N-grams
Lecture Notes, Artificial Intelligence Course, University of Birzeit, Palestine Spring Semester, 2014 (Advanced/) Artificial Intelligence Probabilistic Language Modeling Introduction to N-grams Dr. Mustafa
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationLanguage Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)
Language Modeling Introduction to N-grams Klinton Bicknell (borrowing from: Dan Jurafsky and Jim Martin) Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation:
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationMachine Learning for natural language processing
Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationlanguage modeling: n-gram models
language modeling: n-gram models CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationCS 6120/CS4120: Natural Language Processing
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Today s Outline Probabilistic
More informationLanguage Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduc*on to N- grams Many Slides are adapted from slides by Dan Jurafsky Probabilis1c Language Models Today s goal: assign a probability to a sentence Machine Transla*on: Why? P(high
More informationperplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)
Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More
More informationLanguage Modeling. Introduction to N- grams
Language Modeling Introduction to N- grams Probabilistic Language Models Today s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Why? Spell Correction
More informationLanguage Modeling. Introduction to N- grams
Language Modeling Introduction to N- grams Probabilistic Language Models Today s goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Why? Spell Correction
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationCSC321 Lecture 7 Neural language models
CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We
More informationIntro to Neural Networks and Deep Learning
Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs
More informationNLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)
NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationConditional Language modeling with attention
Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationRecurrent and Recursive Networks
Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.
More informationDT2118 Speech and Speaker Recognition
DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationLanguage Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN
Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationNatural Language Processing (CSE 490U): Language Models
Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,
More informationIntelligent Systems Discriminative Learning, Neural Networks
Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationDeep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning
Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationQuiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7
Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},
More informationNeural Networks 2. 2 Receptive fields and dealing with image inputs
CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There
More informationSupervised Learning. George Konidaris
Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,
More informationNeural Networks. Intro to AI Bert Huang Virginia Tech
Neural Networks Intro to AI Bert Huang Virginia Tech Outline Biological inspiration for artificial neural networks Linear vs. nonlinear functions Learning with neural networks: back propagation https://en.wikipedia.org/wiki/neuron#/media/file:chemical_synapse_schema_cropped.jpg
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationLanguage Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1
Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationWeek 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya
Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques
More informationCMPT-825 Natural Language Processing
CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationSequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018
Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationRecurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)
Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following
More informationChapter 3: Basics of Language Modeling
Chapter 3: Basics of Language Modeling Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise
More informationSPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks
Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension
More informationEmpirical Methods in Natural Language Processing Lecture 5 N-gram Language Models
Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models (most slides from Sharon Goldwater; some adapted from Alex Lascarides) 29 January 2017 Nathan Schneider ENLP Lecture 5
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationDeep Feedforward Networks
Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationLanguage Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016
Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import
More informationA fast and simple algorithm for training neural probabilistic language models
A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1
More information