Logistic Regression & Neural Networks

Size: px

Start display at page:

Download "Logistic Regression & Neural Networks"

Sydney York
5 years ago
Views:

1 Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

2 Logistic Regression

3 Perceptron & Probabilities What if we want a probability p(y x)? The perceptron gives us a prediction y Let s illustrate this with binary classification Illustrations: Graham Neubig

4 The logistic function Softer function than in perceptron Can account for uncertainty Differentiable

5 Logistic regression: how to train? Train based on conditional likelihood Find parameters w that maximize conditional likelihood of all answers y " given examples x "

6 Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate

7 Gradient of the logistic function

8 Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

9 Example: initial update

10 Example: second update

11 How to set the learning rate? Various strategies decay over time α = 1 C + t Parameter Number of samples Use held-out test set, increase learning rate when likelihood increases

12 Multiclass version

13 Some models are better then others Consider these 2 examples Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better

14 Regularization A penalty on adding extra weights L2 regularization: w + big penalty on large weights small penalty on small weights w, L1 regularization: Uniform increase when large or small Will cause many weights to become zero

15 L1 regularization in online learning

16 What you should know Standard supervised learning set-up for text classification Difference between train vs. test data How to evaluate 3 examples of supervised linear classifiers Naïve Bayes, Perceptron, Logistic Regression Learning as optimization: what is the objective function optimized? Difference between generative vs. discriminative classifiers Smoothing, regularization Overfitting, underfitting

17 Neural networks

18 Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

19 Formalizing binary prediction

20 The Perceptron: a machine to calculate a weighted sum φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = sign - w " "/, ϕ " x -1

21 The Perceptron: Geometric interpretation O X O X O X

22 The Perceptron: Geometric interpretation O X O X O X

23 Limitation of perceptron can only find linear separations between positive and negative examples X O O X

24 Neural Networks Connect together multiple perceptrons φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Motivation: Can represent non-linear functions!

25 Neural Networks: key terms φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ black = 0-1 Input (aka features) Output Nodes Layers Hidden layers Activation function (non-linear) Multi-layer perceptron

26 Example Create two classifiers φ 0 [0] φ 0 (x 1 ) = {-1, 1} X φ 0 (x 2 ) = {1, 1} φ 0 [1] 1 sign φ φ 0 [1] 1 [0] O 1-1 φ 0 [0] w 0,0 1 b 0,0 O φ 0 (x 3 ) = {-1, -1} X φ 0 (x 4 ) = {1, -1} φ 0 [0] φ 0 [1] w 0, sign φ 1 [1] 1-1 b 0,1

27 Example These classifiers map to a new space φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 (x 3 ) = {-1, 1} X φ 2 O O φ 1 [1] φ 1 φ 1 [0] O X φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} X φ 1 (x 1 ) = {-1, -1} φ 1 (x 4 ) = {-1, -1} O φ 1 (x 2 ) = {1, -1} φ 1 [0] φ 1 [1]

28 Example In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} X φ 0 [1] φ 0 (x 2 ) = {1, 1} O O X φ 0 [0] φ 0 (x 3 ) = {-1, -1}φ 0 (x 4 ) = {1, -1} φ 2 [0] = y φ 1 [0] φ 1 [1] φ 1 (x 3 ) = {-1, 1} O φ 1 [1] φ 1 [0] φ 1 (x 1 ) = {-1, -1} X O φ 1 (x 2 ) = {1, -1} φ 1 (x 4 ) = {-1, -1}

29 Example wrap-up: Forward propagation The final net φ 0 [0] 1 φ 0 [1] tanh φ 1 [0] 1 φ 0 [0] φ 0 [1] -1-1 tanh φ 1 [1] 1 tanh φ 2 [0]

30 Softmax Function for multiclass classification Sigmoid function for multiple classes P y x = ew 6 7,9 ew 6 7,9; 9; Current class Sum of other classes Can be expressed using matrix/vector ops r = exp W ϕ x, y p = rg- E r r 30

31 Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α

32 Gradient of the Sigmoid Function Take the derivative of the probability d dw P y = 1 x = d dw = ϕ x e w e w 6 7 e w e w d d P y = 1 x = dw dw 1 ew e w 6 7 = ϕ x e w e w 6 7 +

33 Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer h x w 1 dp y = 1 x dw 1 =? ϕ x w 2 w 4 dp y = 1 x dw 4 y=1 = h x e w 4 h e w 4 h 7 + w 3 dp y = 1 x dw 2 =? dp y = 1 x dw 3 =?

34 Answer: Back-Propagation Calculate derivative with chain rule dp y = 1 x dw 1 = dp y = 1 x dw 4 h x dw 4 h x dh, x dh, x dw 1 e w 4 h e w 4 h 7 + w,,r Error of next unit (δ 4 ) Weight Gradient of this unit In General Calculate i based on next units j: dp y = 1 x w i = dh " x dw i - δ U U w ",U

35 Backpropagation = Gradient descent + Chain rule

36 Feed Forward Neural Nets All connections point forward ϕ x y It is a directed acyclic graph (DAG)

37 Neural Networks Non-linear classification Prediction: forward propagation Vector/matrix operations + non-linearities Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ