Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Slides credit: Graham Neubig
Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction
Prediction Problems Given x, predict y 3
Our prediction problem Given an introductory sentence in Wikipedia predict whether the article is about a person
Formalizing binary prediction
Example feature functions: Unigram features Number of times a particular word appears
Calculating the Weighted Sum x = A site, located in Maizuru, Kyoto φ unigram A (x) = φ unigram site (x) = φ unigram located (x) = w unigram a = 0 w unigram site = -3 w unigram located = 0 w unigram Maizuru = 0 w unigram, = 0 w unigram in = 0 w unigram Kyoto = 0 φ unigram Maizuru (x) = φ unigram, (x) = 2 φ unigram in (x) = φ unigram Kyoto (x) = φ unigram priest (x) = 0 w unigram priest = 2 φ unigram black (x) = 0 w unigram black = 0 * = = -3 0-3 0 0 0 0 0 0 0 + + + + + + + + +
The Perceptron: a machine to calculate a weighted sum φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 0-3 0 0 0 0 0 2 0 sign I i= w i ϕ i x
The perceptron: an implementation predict_one(w, phi) score = 0 for each name, value in phi # score = w*φ(x) if name exists in w score += value * w[name] return ( if score >= 0 else ) numpy predict_one(w, phi) score = np.dot( w, phi ) return ( if score[0] >= 0 else )
The Perceptron: Geometric interpretation O X O X O X
The Perceptron: Geometric interpretation O X O X O X
Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction
Exercise Sentiment analysis for movie reviews
Limitation of perceptron can only find linear separations between positive and negative examples X O O X
Neural Networks Connect together multiple perceptrons φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 Motivation: Can represent non-linear functions!
Neural Networks: key terms φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 Input (aka features) Output Nodes Layers Activation function (non-linear) Multi-layer perceptron
Example Create two classifiers (x ) = {, } X (x 2 ) = {, } [0] [] sign φ [] [0] O [0] w 0,0 b 0,0 O (x 3 ) = {, } X (x 4 ) = {, } [0] [] w 0, sign φ [] b 0,
Example These classifiers map to a new space (x ) = {, } (x 2 ) = {, } φ (x 3 ) = {, } X φ 2 O O φ [] φ φ [0] O X (x 3 ) = {, } (x 4 ) = {, } X φ (x ) = {, } φ (x 4 ) = {, } O φ (x 2 ) = {, } φ [0] φ []
Example In new space, the examples are linearly separable! (x ) = {, } X [] (x 2 ) = {, } O O X [0] (x 3 ) = {, } (x 4 ) = {, } φ 2 [0] = y φ [0] φ [] φ (x 3 ) = {, } φ [] O φ [0] φ (x ) = {, } X O φ (x 2 ) = {, } φ (x 4 ) = {, }
Example The final net [0] [] tanh φ [0] [0] [] tanh φ [] tanh φ 2 [0]
Calculating a Net (with Vectors) [0] [] [0] [] Input = np.array( [, ] ) tanh tanh φ [0] φ [] tanh φ 2 [0] First Layer Output w 0,0 = np.array( [, ] ) b 0,0 = np.array( [] ) w 0, = np.array( [, ] ) b 0, = np.array( [] ) φ = np.zeros( 2 ) φ [0] = np.tanh( w 0,0 + b 0,0 )[0] φ [] = np.tanh( w 0, + b 0, )[0] Second Layer Output w,0 = np.array( [, ] ) b,0 = np.array( [] ) φ 2 = np.zeros( ) φ 2 [0] = np.tanh( φ w,0 + b,0 )[0]
Calculating a Net (with Matrices) [0] [] [0] [] Input = np.array( [, ] ) tanh tanh φ [0] φ [] tanh φ 2 [0] First Layer Output w 0 = np.array( [[, ], [,]] ) b 0 = np.array( [, ] ) φ = np.tanh( np.dot(w 0, ) + b 0 ) Second Layer Output w = np.array( [[, ]] ) b = np.array( [] ) φ 2 = np.tanh( np.dot(w, φ ) + b )
Forward Propagation Code forward_nn(network, ) φ= [ ] # Output of each layer for each layer i in.. len(network): w, b = network[i] # Calculate the value based on previous layer φ[i] = np.tanh( np.dot( w, φ[i] ) + b ) return φ # Return the values of all layers
Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction
The perceptron: An online learning algorithm
Perceptron weight update If y =, increase the weights for features in If y =, decrease the weights for features in
Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate
Calculating Error with tanh Error (aka loss) function: Squared error err = (y' y) 2 /2 Correct Answer Net Output Gradient of the error: Update of weights: err' = δ = y' - y w w + λ δ φ x λ is the learning rate
Problem: We don't know the error for hidden layers! The NN only gets the correct label for the final layer φ A = φ site = φ located = 0 y' =? y = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0 2 y' =? y = y' =? y = 3 y' = y =
y Solution: Back Propagation Propagate the error backwards through the layers w=0. δ = -0.9 j w= δ = 0.2 δ i w j,i i w=-0.3 δ = 0.4 Also consider the gradient of the non-linear function 2-4 -2 0 0 2 4-2 dtanh ϕ x w = tanh ϕ x w 2 = y j 2 Together: δ j = y j 2 δ i w j,i i
Back Propagation [0] [] Error of the Output δ 2 = np.array( [y'-y] ) tanh φ [0] Error of the First Layer δ' 2 = δ 2 * (-φ 2 2 ) δ = np.dot(δ' 2, w ) Error of the 0 th Layer δ' = δ * (-φ 2 ) δ 0 = np.dot(δ', w 0 ) [0] [] tanh φ [] tanh φ 2 [0]
Back Propagation Code backward_nn(net, φ, y') J = len(net) create array δ = [ 0, 0,, np.array([y' φ[j][0]]) ] # length J+ create array δ' = [ 0, 0,, 0 ] for i in J.. 0: δ'[i+] = δ[i+] * ( φ[i+] 2 ) w, b = net[i] δ[i] = np.dot(δ'[i+], w) return δ'
Updating Weights Finally, use the error to update weights Grad. of weight w is outer prod. of next δ' and prev φ -derr/dw i = np.outer( δ' i+, φ i ) Multiply by learning rate and update weights w i += λ * -derr/dw i For the bias, input is, so simply δ' -derr/db i = δ' i+ b i += λ * -derr/db i
Weight Update Code update_weights(net, φ, δ', λ) for i in 0.. len(net): w, b = net[i] w += λ * np.outer( δ'[i+], φ[i] ) b += λ * δ'[i+] 34
Overall View of Learning # Create features, initialize weights randomly create map ids, array feat_lab for each labeled pair x, y in the data add (create_features(x), y ) to feat_lab initialize net randomly # Perform training for I iterations for each labeled pair, y in the feat_lab φ= forward_nn(net, ) δ'= backward_nn(net, φ, y) update_weights(net, φ, δ', λ) print net to weight_file print ids to id_file 35
Outline Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Back propagation for learning weights Multiclass prediction
Review: Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its syntactic parse N S VBD VP DET NP NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices)
Multiclass classification with neural networks φ A = φ site = φ located = φ Maizuru = φ, = 2 φ in = φ Kyoto = φ priest = 0 φ black = 0?
p(y x) p(y x) Sigmoid Function The sigmoid softens the step function Step Function P y = x = ew ϕ x + ew ϕ x Sigmoid Function 0.5 0.5 0 0 0 0 w*phi(x) 0 0 0 0 w*phi(x)
Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = x,y ew ϕ y ew ϕ x, y Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x p = r r r r 40
What we ve learned today Perceptron: recap and limitations Neural networks Multi-layer perceptron Forward propagation for prediction Implemented as matrix operations Back propagation for learning weights Gradient descent + chain rule Multiclass prediction
Coming next Recurrent neural networks Applications to NLP Readings Bengio et al. 2005