Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Slides credit: Graham Neubig

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

Prediction Problems Given x, predict y 3

Our prediction problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Formalizing binary prediction

Example feature functions: Unigram features Number of times a particular word appears

Calculating the Weighted Sum x = A site, located in Maizuru, Kyoto φ unigram A (x) = φ unigram site (x) = φ unigram located (x) = w unigram a = 0 w unigram site = -3 w unigram located = 0 w unigram Maizuru = 0 w unigram, = 0 w unigram in = 0 w unigram Kyoto = 0 φ unigram Maizuru (x) = φ unigram, (x) = 2 φ unigram in (x) = φ unigram Kyoto (x) = φ unigram priest (x) = 0 w unigram priest = 2 φ unigram black (x) = 0 w unigram black = 0 * = = -3 0-3 0 0 0 0 0 0 0 + + + + + + + + +

The Perceptron: a machine to calculate a weighted sum φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 0-3 0 0 0 0 0 2 0 sign I i= w i ϕ i x

The perceptron: an implementation predict_one(w, phi) score = 0 for each name, value in phi # score = w*φ(x) if name exists in w score += value * w[name] return ( if score >= 0 else ) numpy predict_one(w, phi) score = np.dot( w, phi ) return ( if score[0] >= 0 else )

The Perceptron: Geometric interpretation O X O X O X

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

Exercise Sentiment analysis for movie reviews

Limitation of perceptron can only find linear separations between positive and negative examples X O O X

Neural Networks Connect together multiple perceptrons φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 Motivation: Can represent non-linear functions!

Neural Networks: key terms φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 Input (aka features) Output Nodes Layers Activation function (non-linear) Multi-layer perceptron

Example Create two classifiers (x ) = {, } X (x 2 ) = {, } [0] [] sign φ [] [0] O [0] w 0,0 b 0,0 O (x 3 ) = {, } X (x 4 ) = {, } [0] [] w 0, sign φ [] b 0,

Example These classifiers map to a new space (x ) = {, } (x 2 ) = {, } φ (x 3 ) = {, } X φ 2 O O φ [] φ φ [0] O X (x 3 ) = {, } (x 4 ) = {, } X φ (x ) = {, } φ (x 4 ) = {, } O φ (x 2 ) = {, } φ [0] φ []

Example In new space, the examples are linearly separable! (x ) = {, } X [] (x 2 ) = {, } O O X [0] (x 3 ) = {, } (x 4 ) = {, } φ 2 [0] = y φ [0] φ [] φ (x 3 ) = {, } φ [] O φ [0] φ (x ) = {, } X O φ (x 2 ) = {, } φ (x 4 ) = {, }

Example The final net [0] [] tanh φ [0] [0] [] tanh φ [] tanh φ 2 [0]

Calculating a Net (with Vectors) [0] [] [0] [] Input = np.array( [, ] ) tanh tanh φ [0] φ [] tanh φ 2 [0] First Layer Output w 0,0 = np.array( [, ] ) b 0,0 = np.array( [] ) w 0, = np.array( [, ] ) b 0, = np.array( [] ) φ = np.zeros( 2 ) φ [0] = np.tanh( w 0,0 + b 0,0 )[0] φ [] = np.tanh( w 0, + b 0, )[0] Second Layer Output w,0 = np.array( [, ] ) b,0 = np.array( [] ) φ 2 = np.zeros( ) φ 2 [0] = np.tanh( φ w,0 + b,0 )[0]

Calculating a Net (with Matrices) [0] [] [0] [] Input = np.array( [, ] ) tanh tanh φ [0] φ [] tanh φ 2 [0] First Layer Output w 0 = np.array( [[, ], [,]] ) b 0 = np.array( [, ] ) φ = np.tanh( np.dot(w 0, ) + b 0 ) Second Layer Output w = np.array( [[, ]] ) b = np.array( [] ) φ 2 = np.tanh( np.dot(w, φ ) + b )

Forward Propagation Code forward_nn(network, ) φ= [ ] # Output of each layer for each layer i in.. len(network): w, b = network[i] # Calculate the value based on previous layer φ[i] = np.tanh( np.dot( w, φ[i] ) + b ) return φ # Return the values of all layers

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

The perceptron: An online learning algorithm

Perceptron weight update If y =, increase the weights for features in If y =, decrease the weights for features in

Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate

Calculating Error with tanh Error (aka loss) function: Squared error err = (y' y) 2 /2 Correct Answer Net Output Gradient of the error: Update of weights: err' = δ = y' - y w w + λ δ φ x λ is the learning rate

Problem: We don't know the error for hidden layers! The NN only gets the correct label for the final layer φ A = φ site = φ located = 0 y' =? y = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 2 y' =? y = y' =? y = 3 y' = y =

y Solution: Back Propagation Propagate the error backwards through the layers w=0. δ = -0.9 j w= δ = 0.2 δ i w j,i i w=-0.3 δ = 0.4 Also consider the gradient of the non-linear function 2-4 -2 0 0 2 4-2 dtanh ϕ x w = tanh ϕ x w 2 = y j 2 Together: δ j = y j 2 δ i w j,i i

Back Propagation [0] [] Error of the Output δ 2 = np.array( [y'-y] ) tanh φ [0] Error of the First Layer δ' 2 = δ 2 * (-φ 2 2 ) δ = np.dot(δ' 2, w ) Error of the 0 th Layer δ' = δ * (-φ 2 ) δ 0 = np.dot(δ', w 0 ) [0] [] tanh φ [] tanh φ 2 [0]

Back Propagation Code backward_nn(net, φ, y') J = len(net) create array δ = [ 0, 0,, np.array([y' φ[j][0]]) ] # length J+ create array δ' = [ 0, 0,, 0 ] for i in J.. 0: δ'[i+] = δ[i+] * ( φ[i+] 2 ) w, b = net[i] δ[i] = np.dot(δ'[i+], w) return δ'

Updating Weights Finally, use the error to update weights Grad. of weight w is outer prod. of next δ' and prev φ -derr/dw i = np.outer( δ' i+, φ i ) Multiply by learning rate and update weights w i += λ * -derr/dw i For the bias, input is, so simply δ' -derr/db i = δ' i+ b i += λ * -derr/db i

Weight Update Code update_weights(net, φ, δ', λ) for i in 0.. len(net): w, b = net[i] w += λ * np.outer( δ'[i+], φ[i] ) b += λ * δ'[i+] 34

Overall View of Learning # Create features, initialize weights randomly create map ids, array feat_lab for each labeled pair x, y in the data add (create_features(x), y ) to feat_lab initialize net randomly # Perform training for I iterations for each labeled pair, y in the feat_lab φ= forward_nn(net, ) δ'= backward_nn(net, φ, y) update_weights(net, φ, δ', λ) print net to weight_file print ids to id_file 35

Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction

Review: Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its syntactic parse N S VBD VP DET NP NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices)

Multiclass classification with neural networks φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0?

p(y x) p(y x) Sigmoid Function The sigmoid softens the step function Step Function P y = x = ew ϕ x + ew ϕ x Sigmoid Function 0.5 0.5 0 0 0 0 w*phi(x) 0 0 0 0 w*phi(x)

Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = x,y ew ϕ y ew ϕ x, y Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x p = r r r r 40

What we ve learned today Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Implemented as matrix operations Back propagation for learning weights Gradient descent + chain rule Multiclass prediction

Coming next Recurrent neural networks Applications to NLP Readings Bengio et al. 2005