Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein
Logistic Regression
Perceptron & Probabilities What if we want a probability p(y x)? The perceptron gives us a prediction y Let s illustrate this with binary classification Illustrations: Graham Neubig
The logistic function Softer function than in perceptron Can account for uncertainty Differentiable
Logistic regression: how to train? Train based on conditional likelihood Find parameters w that maximize conditional likelihood of all answers y " given examples x "
Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate
Gradient of the logistic function
Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person
Example: initial update
Example: second update
How to set the learning rate? Various strategies decay over time α = 1 C + t Parameter Number of samples Use held-out test set, increase learning rate when likelihood increases
Multiclass version
Some models are better then others Consider these 2 examples Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better
Regularization A penalty on adding extra weights L2 regularization: w + big penalty on large weights small penalty on small weights w, L1 regularization: Uniform increase when large or small Will cause many weights to become zero
L1 regularization in online learning
What you should know Standard supervised learning set-up for text classification Difference between train vs. test data How to evaluate 3 examples of supervised linear classifiers Naïve Bayes, Perceptron, Logistic Regression Learning as optimization: what is the objective function optimized? Difference between generative vs. discriminative classifiers Smoothing, regularization Overfitting, underfitting
Neural networks
Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person
Formalizing binary prediction
The Perceptron: a machine to calculate a weighted sum φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0 0-3 0 0 0 0 0 2 0. sign - w " "/, ϕ " x -1
The Perceptron: Geometric interpretation O X O X O X
The Perceptron: Geometric interpretation O X O X O X
Limitation of perceptron can only find linear separations between positive and negative examples X O O X
Neural Networks Connect together multiple perceptrons φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Motivation: Can represent non-linear functions!
Neural Networks: key terms φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Input (aka features) Output Nodes Layers Hidden layers Activation function (non-linear) Multi-layer perceptron
Example Create two classifiers φ 0 [0] φ 0 (x 1 ) = {-1, 1} X φ 0 (x 2 ) = {1, 1} φ 0 [1] 1 sign φ φ 0 [1] 1 [0] O 1-1 φ 0 [0] w 0,0 1 b 0,0 O φ 0 (x 3 ) = {-1, -1} X φ 0 (x 4 ) = {1, -1} φ 0 [0] φ 0 [1] w 0,1-1 -1 sign φ 1 [1] 1-1 b 0,1
Example These classifiers map to a new space φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 (x 3 ) = {-1, 1} X φ 2 O O φ 1 [1] φ 1 φ 1 [0] O X φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} X φ 1 (x 1 ) = {-1, -1} φ 1 (x 4 ) = {-1, -1} O φ 1 (x 2 ) = {1, -1} 1 1-1 φ 1 [0] -1-1 -1 φ 1 [1]
Example In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} X φ 0 [1] φ 0 (x 2 ) = {1, 1} O O X φ 0 [0] φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} 1 1 1 φ 2 [0] = y 1 1-1 -1-1 -1 φ 1 [0] φ 1 [1] φ 1 (x 3 ) = {-1, 1} O φ 1 [1] φ 1 [0] φ 1 (x 1 ) = {-1, -1} X O φ 1 (x 2 ) = {1, -1} φ 1 (x 4 ) = {-1, -1}
Example wrap-up: Forward propagation The final net φ 0 [0] 1 φ 0 [1] 1 1-1 tanh φ 1 [0] 1 φ 0 [0] φ 0 [1] -1-1 tanh φ 1 [1] 1 tanh φ 2 [0] 1-1 1 1
Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = ew 6 7,9 ew 6 7,9; 9; Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x, y p = rg- E r r 30
Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α
Gradient of the Sigmoid Function Take the derivative of the probability d dw P y = 1 x = d dw = ϕ x e w 6 7 1 + e w 6 7 e w 6 7 1 + e w 6 7 + d d P y = 1 x = dw dw 1 ew 6 7 1 + e w 6 7 = ϕ x e w 6 7 1 + e w 6 7 +
Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer h x w 1 dp y = 1 x dw 1 =? ϕ x w 2 w 4 dp y = 1 x dw 4 y=1 = h x e w 4 h 7 1 + e w 4 h 7 + w 3 dp y = 1 x dw 2 =? dp y = 1 x dw 3 =?
Answer: Back-Propagation Calculate derivative with chain rule dp y = 1 x dw 1 = dp y = 1 x dw 4 h x dw 4 h x dh, x dh, x dw 1 e w 4 h 7 1 + e w 4 h 7 + w,,r Error of next unit (δ 4 ) Weight Gradient of this unit In General Calculate i based on next units j: dp y = 1 x w i = dh " x dw i - δ U U w ",U
Backpropagation = Gradient descent + Chain rule
Feed Forward Neural Nets All connections point forward ϕ x y It is a directed acyclic graph (DAG)
Neural Networks Non-linear classification Prediction: forward propagation Vector/matrix operations + non-linearities Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7