Logistic Regression & Neural Networks

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

Logistic Regression

Perceptron & Probabilities What if we want a probability p(y x)? The perceptron gives us a prediction y Let s illustrate this with binary classification Illustrations: Graham Neubig

The logistic function Softer function than in perceptron Can account for uncertainty Differentiable

Logistic regression: how to train? Train based on conditional likelihood Find parameters w that maximize conditional likelihood of all answers y " given examples x "

Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate

Gradient of the logistic function

Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Example: initial update

Example: second update

How to set the learning rate? Various strategies decay over time α = 1 C + t Parameter Number of samples Use held-out test set, increase learning rate when likelihood increases

Multiclass version

Some models are better then others Consider these 2 examples Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better

Regularization A penalty on adding extra weights L2 regularization: w + big penalty on large weights small penalty on small weights w, L1 regularization: Uniform increase when large or small Will cause many weights to become zero

L1 regularization in online learning

What you should know Standard supervised learning set-up for text classification Difference between train vs. test data How to evaluate 3 examples of supervised linear classifiers Naïve Bayes, Perceptron, Logistic Regression Learning as optimization: what is the objective function optimized? Difference between generative vs. discriminative classifiers Smoothing, regularization Overfitting, underfitting

Neural networks

Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Formalizing binary prediction

The Perceptron: a machine to calculate a weighted sum φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0 0-3 0 0 0 0 0 2 0. sign - w " "/, ϕ " x -1

The Perceptron: Geometric interpretation O X O X O X

Limitation of perceptron can only find linear separations between positive and negative examples X O O X

Neural Networks Connect together multiple perceptrons φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Motivation: Can represent non-linear functions!

Neural Networks: key terms φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Input (aka features) Output Nodes Layers Hidden layers Activation function (non-linear) Multi-layer perceptron

Example Create two classifiers φ 0 [0] φ 0 (x 1 ) = {-1, 1} X φ 0 (x 2 ) = {1, 1} φ 0 [1] 1 sign φ φ 0 [1] 1 [0] O 1-1 φ 0 [0] w 0,0 1 b 0,0 O φ 0 (x 3 ) = {-1, -1} X φ 0 (x 4 ) = {1, -1} φ 0 [0] φ 0 [1] w 0,1-1 -1 sign φ 1 [1] 1-1 b 0,1

Example These classifiers map to a new space φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 (x 3 ) = {-1, 1} X φ 2 O O φ 1 [1] φ 1 φ 1 [0] O X φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} X φ 1 (x 1 ) = {-1, -1} φ 1 (x 4 ) = {-1, -1} O φ 1 (x 2 ) = {1, -1} 1 1-1 φ 1 [0] -1-1 -1 φ 1 [1]

Example In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} X φ 0 [1] φ 0 (x 2 ) = {1, 1} O O X φ 0 [0] φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} 1 1 1 φ 2 [0] = y 1 1-1 -1-1 -1 φ 1 [0] φ 1 [1] φ 1 (x 3 ) = {-1, 1} O φ 1 [1] φ 1 [0] φ 1 (x 1 ) = {-1, -1} X O φ 1 (x 2 ) = {1, -1} φ 1 (x 4 ) = {-1, -1}

Example wrap-up: Forward propagation The final net φ 0 [0] 1 φ 0 [1] 1 1-1 tanh φ 1 [0] 1 φ 0 [0] φ 0 [1] -1-1 tanh φ 1 [1] 1 tanh φ 2 [0] 1-1 1 1

Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = ew 6 7,9 ew 6 7,9; 9; Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x, y p = rg- E r r 30

Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α

Gradient of the Sigmoid Function Take the derivative of the probability d dw P y = 1 x = d dw = ϕ x e w 6 7 1 + e w 6 7 e w 6 7 1 + e w 6 7 + d d P y = 1 x = dw dw 1 ew 6 7 1 + e w 6 7 = ϕ x e w 6 7 1 + e w 6 7 +

Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer h x w 1 dp y = 1 x dw 1 =? ϕ x w 2 w 4 dp y = 1 x dw 4 y=1 = h x e w 4 h 7 1 + e w 4 h 7 + w 3 dp y = 1 x dw 2 =? dp y = 1 x dw 3 =?

Answer: Back-Propagation Calculate derivative with chain rule dp y = 1 x dw 1 = dp y = 1 x dw 4 h x dw 4 h x dh, x dh, x dw 1 e w 4 h 7 1 + e w 4 h 7 + w,,r Error of next unit (δ 4 ) Weight Gradient of this unit In General Calculate i based on next units j: dp y = 1 x w i = dh " x dw i - δ U U w ",U

Backpropagation = Gradient descent + Chain rule

Feed Forward Neural Nets All connections point forward ϕ x y It is a directed acyclic graph (DAG)

Neural Networks Non-linear classification Prediction: forward propagation Vector/matrix operations + non-linearities Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7