Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Similar documents
NLP Programming Tutorial 8 - Recurrent Neural Nets

Logistic Regression & Neural Networks

Sequential Data Modeling - Conditional Random Fields

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Sequential Data Modeling - The Structured Perceptron

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 6 - Advanced Discriminative Learning

ECE521 Lectures 9 Fully Connected Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

More about the Perceptron

Sequence Labeling: HMMs & Structured Perceptron

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Lecture 10

From perceptrons to word embeddings. Simon Šuster University of Groningen

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

text classification 3: neural networks

Lecture 17: Neural Networks and Deep Learning

Machine Learning Lecture 12

Multilayer Perceptron

Lecture 5 Neural models for NLP

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

y(x n, w) t n 2. (1)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Natural Language Processing

FINAL: CS 6375 (Machine Learning) Fall 2014

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Network Language Modeling

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Neural Network Training

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Neural Networks

Linear and Logistic Regression. Dr. Xiaowei Huang

Neural networks. Chapter 20. Chapter 20 1

NEURAL LANGUAGE MODELS

Machine Learning Lecture 5

Stochastic gradient descent; Classification

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Introduction to Machine Learning

Neural Networks and Deep Learning

Neural networks. Chapter 19, Sections 1 5 1

Deep Neural Networks (1) Hidden layers; Back-propagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Multi-layer Neural Networks

CSCI567 Machine Learning (Fall 2018)

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks. MGS Lecture 2

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 4: Learning a Classifier

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Logistic Regression. COMP 527 Danushka Bollegala

CSC321 Lecture 4: Learning a Classifier

Deep Neural Networks (1) Hidden layers; Back-propagation

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CSC242: Intro to AI. Lecture 21

Intro to Neural Networks and Deep Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks and the Back-propagation Algorithm

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Reading Group on Deep Learning Session 1

Machine Learning Basics

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Course 395: Machine Learning - Lectures

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical NLP for the Web

Feed-forward Network Functions

lecture 6: modeling sequences (final part)

Jakub Hajic Artificial Intelligence Seminar I

Statistical Machine Learning from Data

Artificial Intelligence

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Convolutional Neural Networks

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 13: Structured Prediction

Midterm: CS 6375 Spring 2018

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Feedforward Networks

Neural Networks: Backpropagation

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Naïve Bayes, Maxent and Neural Models

CSC 411 Lecture 10: Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Revision: Neural Network

Machine Learning for NLP

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

CSC321 Lecture 6: Backpropagation

Lecture 6. Notes on Linear Algebra. Perceptron

Lab 5: 16 th April Exercises on Neural Networks

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Intelligent Systems (AI-2)

Neural Networks: Backpropagation

Transcription:

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Slides credit: Graham Neubig

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

Prediction Problems Given x, predict y 3

Our prediction problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Formalizing binary prediction

Example feature functions: Unigram features Number of times a particular word appears

Calculating the Weighted Sum x = A site, located in Maizuru, Kyoto φ unigram A (x) = φ unigram site (x) = φ unigram located (x) = w unigram a = 0 w unigram site = -3 w unigram located = 0 w unigram Maizuru = 0 w unigram, = 0 w unigram in = 0 w unigram Kyoto = 0 φ unigram Maizuru (x) = φ unigram, (x) = 2 φ unigram in (x) = φ unigram Kyoto (x) = φ unigram priest (x) = 0 w unigram priest = 2 φ unigram black (x) = 0 w unigram black = 0 * = = -3 0-3 0 0 0 0 0 0 0 + + + + + + + + +

The Perceptron: a machine to calculate a weighted sum φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 0-3 0 0 0 0 0 2 0 sign I i= w i ϕ i x

The perceptron: an implementation predict_one(w, phi) score = 0 for each name, value in phi # score = w*φ(x) if name exists in w score += value * w[name] return ( if score >= 0 else ) numpy predict_one(w, phi) score = np.dot( w, phi ) return ( if score[0] >= 0 else )

The Perceptron: Geometric interpretation O X O X O X

The Perceptron: Geometric interpretation O X O X O X

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

Exercise Sentiment analysis for movie reviews

Limitation of perceptron can only find linear separations between positive and negative examples X O O X

Neural Networks Connect together multiple perceptrons φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 Motivation: Can represent non-linear functions!

Neural Networks: key terms φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 Input (aka features) Output Nodes Layers Activation function (non-linear) Multi-layer perceptron

Example Create two classifiers (x ) = {, } X (x 2 ) = {, } [0] [] sign φ [] [0] O [0] w 0,0 b 0,0 O (x 3 ) = {, } X (x 4 ) = {, } [0] [] w 0, sign φ [] b 0,

Example These classifiers map to a new space (x ) = {, } (x 2 ) = {, } φ (x 3 ) = {, } X φ 2 O O φ [] φ φ [0] O X (x 3 ) = {, } (x 4 ) = {, } X φ (x ) = {, } φ (x 4 ) = {, } O φ (x 2 ) = {, } φ [0] φ []

Example In new space, the examples are linearly separable! (x ) = {, } X [] (x 2 ) = {, } O O X [0] (x 3 ) = {, } (x 4 ) = {, } φ 2 [0] = y φ [0] φ [] φ (x 3 ) = {, } φ [] O φ [0] φ (x ) = {, } X O φ (x 2 ) = {, } φ (x 4 ) = {, }

Example The final net [0] [] tanh φ [0] [0] [] tanh φ [] tanh φ 2 [0]

Calculating a Net (with Vectors) [0] [] [0] [] Input = np.array( [, ] ) tanh tanh φ [0] φ [] tanh φ 2 [0] First Layer Output w 0,0 = np.array( [, ] ) b 0,0 = np.array( [] ) w 0, = np.array( [, ] ) b 0, = np.array( [] ) φ = np.zeros( 2 ) φ [0] = np.tanh( w 0,0 + b 0,0 )[0] φ [] = np.tanh( w 0, + b 0, )[0] Second Layer Output w,0 = np.array( [, ] ) b,0 = np.array( [] ) φ 2 = np.zeros( ) φ 2 [0] = np.tanh( φ w,0 + b,0 )[0]

Calculating a Net (with Matrices) [0] [] [0] [] Input = np.array( [, ] ) tanh tanh φ [0] φ [] tanh φ 2 [0] First Layer Output w 0 = np.array( [[, ], [,]] ) b 0 = np.array( [, ] ) φ = np.tanh( np.dot(w 0, ) + b 0 ) Second Layer Output w = np.array( [[, ]] ) b = np.array( [] ) φ 2 = np.tanh( np.dot(w, φ ) + b )

Forward Propagation Code forward_nn(network, ) φ= [ ] # Output of each layer for each layer i in.. len(network): w, b = network[i] # Calculate the value based on previous layer φ[i] = np.tanh( np.dot( w, φ[i] ) + b ) return φ # Return the values of all layers

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

The perceptron: An online learning algorithm

Perceptron weight update If y =, increase the weights for features in If y =, decrease the weights for features in

Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate

Calculating Error with tanh Error (aka loss) function: Squared error err = (y' y) 2 /2 Correct Answer Net Output Gradient of the error: Update of weights: err' = δ = y' - y w w + λ δ φ x λ is the learning rate

Problem: We don't know the error for hidden layers! The NN only gets the correct label for the final layer φ A = φ site = φ located = 0 y' =? y = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 2 y' =? y = y' =? y = 3 y' = y =

y Solution: Back Propagation Propagate the error backwards through the layers w=0. δ = -0.9 j w= δ = 0.2 δ i w j,i i w=-0.3 δ = 0.4 Also consider the gradient of the non-linear function 2-4 -2 0 0 2 4-2 dtanh ϕ x w = tanh ϕ x w 2 = y j 2 Together: δ j = y j 2 δ i w j,i i

Back Propagation [0] [] Error of the Output δ 2 = np.array( [y'-y] ) tanh φ [0] Error of the First Layer δ' 2 = δ 2 * (-φ 2 2 ) δ = np.dot(δ' 2, w ) Error of the 0 th Layer δ' = δ * (-φ 2 ) δ 0 = np.dot(δ', w 0 ) [0] [] tanh φ [] tanh φ 2 [0]

Back Propagation Code backward_nn(net, φ, y') J = len(net) create array δ = [ 0, 0,, np.array([y' φ[j][0]]) ] # length J+ create array δ' = [ 0, 0,, 0 ] for i in J.. 0: δ'[i+] = δ[i+] * ( φ[i+] 2 ) w, b = net[i] δ[i] = np.dot(δ'[i+], w) return δ'

Updating Weights Finally, use the error to update weights Grad. of weight w is outer prod. of next δ' and prev φ -derr/dw i = np.outer( δ' i+, φ i ) Multiply by learning rate and update weights w i += λ * -derr/dw i For the bias, input is, so simply δ' -derr/db i = δ' i+ b i += λ * -derr/db i

Weight Update Code update_weights(net, φ, δ', λ) for i in 0.. len(net): w, b = net[i] w += λ * np.outer( δ'[i+], φ[i] ) b += λ * δ'[i+] 34

Overall View of Learning # Create features, initialize weights randomly create map ids, array feat_lab for each labeled pair x, y in the data add (create_features(x), y ) to feat_lab initialize net randomly # Perform training for I iterations for each labeled pair, y in the feat_lab φ= forward_nn(net, ) δ'= backward_nn(net, φ, y) update_weights(net, φ, δ', λ) print net to weight_file print ids to id_file 35

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

Review: Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its syntactic parse N S VBD VP DET NP NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices)

Multiclass classification with neural networks φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0?

p(y x) p(y x) Sigmoid Function The sigmoid softens the step function Step Function P y = x = ew ϕ x + ew ϕ x Sigmoid Function 0.5 0.5 0 0 0 0 w*phi(x) 0 0 0 0 w*phi(x)

Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = x,y ew ϕ y ew ϕ x, y Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x p = r r r r 40

What we ve learned today Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Implemented as matrix operations Back propagation for learning weights Gradient descent + chain rule Multiclass prediction

Coming next Recurrent neural networks Applications to NLP Readings Bengio et al. 2005