Neural Networks in Structured Prediction. November 17, 2015

Similar documents
Lecture 5 Neural models for NLP

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Machine Learning (CSE 446): Backpropagation

Artificial Neural Networks. MGS Lecture 2

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Artificial Neuron (Perceptron)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 17: Neural Networks and Deep Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning (CSE 446): Neural Networks

CSC321 Lecture 6: Backpropagation

text classification 3: neural networks

CSC242: Intro to AI. Lecture 21

Natural Language Processing

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Intro to Neural Networks and Deep Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Architectures for Image, Language, and Speech Processing

Machine Learning. Neural Networks

CSC321 Lecture 16: ResNets and Attention

Neural Networks and Deep Learning

Applied Natural Language Processing

CSC321 Lecture 5: Multilayer Perceptrons

ECE521 Lecture 7/8. Logistic Regression

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Natural Language Processing with Deep Learning CS224N/Ling284

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Intro to AI Bert Huang Virginia Tech

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSC 411 Lecture 10: Neural Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Artificial Neural Networks 2

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks: Backpropagation

CSE 490 U Natural Language Processing Spring 2016

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Conditional Language modeling with attention

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Learning Recurrent Networks 2/28/2018

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Lecture 2: Modular Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep Learning for NLP

Recurrent and Recursive Networks

Natural Language Processing

Automatic Differentiation and Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Stochastic Gradient Descent

Lecture 15: Exploding and Vanishing Gradients

NEURAL LANGUAGE MODELS

Deep Learning for NLP

Intelligent Systems (AI-2)

Lecture 4 Backpropagation

ECE 5984: Introduction to Machine Learning

Perceptron & Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

How to do backpropagation in a brain

Intelligent Systems (AI-2)

Learning from Data: Multi-layer Perceptrons

Artificial Neural Networks

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

A brief introduction to Conditional Random Fields

Artificial Intelligence

Recurrent Neural Networks. Jian Tang

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

AI Programming CS F-20 Neural Networks

Natural Language Processing and Recurrent Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSE 447/547 Natural Language Processing Winter 2018

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Logistic Regression & Neural Networks

Tutorial on Machine Learning for Advanced Electronics

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Sequence Modeling with Neural Networks

CSC321 Lecture 4: Learning a Classifier

INTRODUCTION TO DATA SCIENCE

Computational Graphs, and Backpropagation

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Neural Nets Supervised learning

NLP Programming Tutorial 8 - Recurrent Neural Nets

CSC321 Lecture 4: Learning a Classifier

Advanced Machine Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSC321 Lecture 15: Exploding and Vanishing Gradients

Neural Networks. Volker Tresp Summer 2015

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Computational statistics

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Feedforward Neural Networks

Multilayer Perceptron

Linear Models in Machine Learning

CSC321 Lecture 20: Reversible and Autoregressive Models

Transcription:

Neural Networks in Structured Prediction November 17, 2015

HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving (Dec 3) Bring draft of paper to class for discussion

Goals for This Week s Lectures Overview of neural networks (terminology, basic architectures, learning) Neural networks in structured prediction: Option 1: locally nonlinear factors in globally linear models Option 2: operation sequence models Option 3: global, nonlinear structured models [speculative]

Neural Nets: Big Ideas Nonlinear function classes Learning via backpropagation of errors Neural networks as feature inducers ŷ = Wx + b What features should we use?? The multitask hypothesis

Recall: Parameters We want to condition on lots of information, but recall that X,Y (x, y) = X Y =y (x) Y (y) O(xy + y) =O(xy) Neural networks let us learn arbitrary joint interactions, but control the number parameters. - Control overfitting/improve generalization - Reduce memory requirements

When to use them? If you want to condition on a lot of information (entire documents, entire sentences, ) But you don t know what features are useful Neural networks are great!

Neurons x 1 x 2 x 3 x 4 x 5 x

Neurons x 1 x 2 x 3 x 4 x 5 x w 1 w 2 w 3 w 4 w 5 w

Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w x 4 w 4 x 5 w 5 x w

Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w y = w x + b x 4 w 4 b is a bias term, can also x 5 w 5 be encoded by forcing one of the inputs to always be 1. x w

Neurons x 1 w 1 x 2 x 3 w 2 w 3 y y = x w y = w x + b x 4 w 4 y = g(w x + b) x 5 x w 5 w g is a nonlinear function that usually has a sigmoid shape

Neurons x 1 x 2 x 3 x 4 x 5 x w 1 w 2 w 3 w 4 w 5 w y

Neural Networks x 1 x 2 x 3 x 4 x 5 x

Neural Networks x 1 x 2 x 3 x 4 x 5 x w 4,1

Neural Networks x 1 x 2 x 3 y 1 y 2 y = Wx + b x 4 y 3 x 5 x W y

Neural Networks x 1 x 2 y 1 y = Wx + b x 3 y 2 y = g(wx + b) x 4 y 3 x 5 x W y Soft max g(u) i = exp u i P i 0 exp u i 0

Deep x 1 x 2 y 1 x 3 y 2 x 4 y 3 x 5 x W y

Deep x 1 x 2 y 1 x 3 y 2 x 4 y 3 x 5 x W y V

Deep x 1 z 1 x 2 y 1 z 2 x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z

Deep x 1 z 1 z = g(vy + c) x 2 y 1 z 2 x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z

Deep x 1 x 2 y 1 z 1 z 2 z = g(vy + c) z = g(vh(wx + b)+c) x 3 y 2 z 3 x 4 y 3 z 4 x 5 x W y V z

Deep x 1 x 2 y 1 z 1 z 2 z = g(vy + c) z = g(vh(wx + b)+c) x 3 x 4 y 2 y 3 z 3 Note: if g(x) =h(x) =x x 5 x W y V z 4 z z = V(Wx + b)+c = VWx {z } Ux + Vb + c {z } d

Predicting Discrete Objects x 1 z 1 DT x 2 y 1 z 2 NN x 3 y 2 z 3 VB x 4 y 3 z 4 JJ x 5 x W y V z Soft max g(u) i = exp u i P i 0 exp u i 0

More Concise tanh h = tanh(wx + b) y = Vh + a x W h V y

Feature Induction ŷ = Wx + b What features should we use?? L = X i ŷ i y i 2 2 In linear regression the goal is to learn W such that L is minimized on a training set. ŷ = W g(vr + c) {z } x +b Compute features of naive representation r, put result in x Learn to extract features that produce linear separability

Feature Induction ŷ = W g(vr + c) {z } x +b What can this function represent? If x is big enough, this can represent any function! This is obvious much more powerful than a linear model

Joint Interactions?

Feature Induction Neural networks let us use a fixed number of parameters Avoid the curse of dimensionality! But they let us learn interactions (conjunctions) - if it is helpful to do so. Let the data decide!

Training Neural networks are supervised We need pairs of inputs and outputs For LM: input = history, output = word Also needed: a differentiable loss function L(x, y)! R + Losses over training instances sum Any of our favorite losses work here

Loss Functions Squared Error (compare two vectors) L = 1 2 y y 2 Cross Entropy / Log Loss L = log p(y x)

Loss Functions Squared Error (compare two vectors) L = 1 2 y y 2 Cross Entropy / Log Loss L = log p(y x)

Stochastic Gradient Descent for i = 1, 2, Pick random training example t and compute: g (i) = @L(x t,y t ) @ (i+1) = (i) g (i) = (i)

Computing Derivatives Training neural networks involves computing derivatives The standard algorithm for doing this is called backpropagation. It is a variant of automatic differentiation (technically reverse-mode AD )

Automatic Differentiation? Compiler translates a function into a sequence of small operations Every small operation (nb. of a differentiable function) is itself differentiable The chain rule tells us how to compute the derivatives of composite functions using the derivatives of the pieces they are composed of

Let s start with a really simple example. y = log sin 2 x What is the derivative at x 0? components range differential d-range y = f(u) = log u u = g(v) =v 2 v = h(x) =sinx R R R dy du = 1 u du dv =2v dv dx = cos x R R R dy dx x=x0 = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0

In general, for our applications x in f(x) will be a vector. nx y = (W exp x) i where x 2 R d and W 2 R n d i=1 components range differential d-range nx y = f(u) = i=1 u i u = g(v) =Wv v = h(x) =expx R R n R d @y @u = 1 @u @v = W @v = diag(exp x) @x R 1 n R n d R d d dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d

Two Evaluation Strategies Forward dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d Backward or Adjoint

Two Evaluation Strategies Forward dy dx x=x0 R 1 n R n d R d d = dy du u=g(h(x0 )) du dv v=h(x0 ) dv dx x=x0 2 R n 2 R d 2 R d Backward or Adjoint

Learning with Backprop Training data x1 x2 y* 0 0 0 Models y = V(Wx + b)+c y = V tanh(wx + b)+c 1 0 1 0 1 1 Objective L =(y(x) y ) 2 1 1 0

Learning with Backprop y - y* 2 y = V(Wx + b)+c Iteration

Learning with Backprop y - y* 2 y = V(Wx + b)+c y = V tanh(wx + b)+c Iteration

Concept: Word Embedding Represent each word in a vocabulary as a vector Vectors are learned as parameters Words that behave similarly in the model will usually end up with similar embeddings

A simple tagging model x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

A simple tagging model NNP h 1 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

A simple tagging model NNP VBZ h 1 h 2 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

A simple tagging model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP h t = tanh (C 1 x t 1 + C 0 x t + C +1 x t+1 + b) p(y i = y x,i)= p(y x) = y Y i=1 exp w y > h i + a y P y 0 2Y exp w> y h i + a y p(y i x,i)

Simple Tagging Model Parameters: word embeddings, C -1, C 0, C 2, W, b Upsides of this model: Decisions are independent- likelihood/decoding is cheap (no Viterbi!) Downsides of this model Say that (A B) and (B A) are both good taggings according to the observations, but (A A) and (B B) are bad taggings. Independence hurts us! Limited context window

Simple Tagging Model In short: this is not a structured model, it is a bunch of independent classification decisions But, neural networks are really powerful learners maybe we don t really need as much structure? How can we address the finite horizon problem?

Revised Tagging Model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP

Revised Tagging Model NNP VBZ NNS h 1 h 2 h 3 x = START start John books flights start STOP

Recurrent Neural Networks (RNNs) h t = f(h t 1, x t ) c = RNN(x) 0 c x = START start x 1 x 2 x 3 x 4 What is a vector representation of a sequence? Note: numerous definitions exist for f. x

Bidirectional RNN Tagging Model 0 x = START start John books flights start STOP

Bidirectional RNN Tagging Model 0 0 x = START start John books flights start STOP

Bidirectional RNN Tagging Model NNP VBZ NNS 0 0 x = START start John books flights start STOP

BiRNN Tagging Model In short: this is still not a structured model, it is a bunch of independent classification decisions But, neural networks are really powerful learners maybe we don t really need as much structure? This POS tagger is currently state-of-the-art! Maybe POS tagging doesn t need that much structure

Next time Recurrent models for adding statistical dependencies among output variables