TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)

Size: px

Start display at page:

Download "TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Multiclass Logistic Regression. Multilayer Perceptrons (MLPs)"

Anastasia Montgomery
5 years ago
Views:

1 TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Multiclass Logistic Regression Multilayer Perceptrons (MLPs) Stochastic Gradient Descent (SGD) 1

2 Multiclass Classification We consider the problem of taking an input x (such as an image of a hand written digit) and classifying it into some small number of classes (such as the digits 0 through 9). 2

3 Multiclass Classification Assume a population distribution on pairs (x, y) for x R d and y C. For MNIST x is a image which we take to be a 784 dimensional vector giving x R 784. For MNIST C is the set {0,..., 9}. Assume a sample (x 0, y 0 ),..., (x N 1, y N 1 ) drawn IID from the population. We want to use the sample to construct a rule for predicting y given x when we draw new pairs from the population. 3

4 Multiclass Logistic Regression Assume a sample (x 0, y 0 ),..., (x N 1, y N 1 ) drawn IID from the population with x R d and y {0,..., K}. For a new x we compute a score s(ŷ) for each possible label ŷ. s = W x + b 4

5 Multiclass Logistic Regression for MNIST j image pixel ŷ possible image label (0 through 9) s(ŷ) = j W [ŷ, j] x[j] + b[ŷ] Note that W [ŷ, :] is an image. 5

6 Softmax Softmax converts scores (or energies or logits) to probabilities. Q(ŷ) = 1 Z es(ŷ) Z = ŷ e s(ŷ) In vector notation Q = softmax s 6

7 Log Loss and Logistic Regression Let Q Φ (ŷ x) be defined by a model with parammeters Φ. In logistic regression Φ is the pair (W, b). Let n range over training instances. W, b = argmin W,b 1 N N n=1 log Q W,b (y n x n ) Φ = argmin Φ 1 N N n=1 log Q Φ (y n x n ) 7

8 Information Theoretic Formulation Let Φ be the parameters of a probabilistc predictor Q Φ. We want Φ = argmin Φ E (x,y) P log Q Φ (y x). This is cross-entropy loss: H(P, Q) = E y P log Q(y) H(P) = H(P, P) = E y P log P(y) H(P, Q) H(P) E (x,y) P log Q Φ (y x) = E x P H(P(: x), Q Φ (: x)) 8

9 Multi Layer Perceptrons (MLPs) Activation functions: 1 σ(u) = 1 + e u Relu(u) = max(u, 0) L 0 = Relu(W 0 x + b 0 ) L 1 = σ(w 1 L 0 + b 1 ) Q Φ = softmax(l 1 ) 9

10 Explicit Index Notation with Typed Index Variables i pixels j image features ŷ possible image labels L 0 [j] = Relu i W 0 [j, i] x[i] + b 0 [j] L 1 [ŷ] = σ j W 1 [ŷ, j] L 0 [j] + b 1 [ŷ] Q Φ (ŷ) = 1 Z el1 [ŷ] 10

11 Loss Vs. Error Rate While training (gradient descent) is generally done on log loss, performance is often judged by other measures such as error rate. The loss is often used as a synonym for log loss (or whatever loss defined the gradient descent training). Hence one often reports both loss and error rate. Note that error rate is not differentiable. 11

12 Train Data, Development Data and Test Data Data is typically divided into a training set, a development set and a test set each drawn IID from the population. A learning algorithm optimizes training loss. One then optimizes algorithm design (and hyper-parameters) on the development set. (graduate student descent). Ultimate performance should be done on a test set not used for development. Test data is often withheld from developers. 12

13 Gradients with Respect to Systems of Parameters Φ l(φ, x, y) denotes the partial derivative of l(φ, x, y) with respect to the parameter system Φ. Here can think of Φ as a single vector with ( Φ l(φ, x, y)) i = l(φ, x, y)/ Φ i But in general Φ can be a multi-dimensional array (an ndarray in NumPy). If Φ is four dimensional we can write Φ[i, j, k, l]. For scalar loss, Φ l(φ, x, y) has the same shape as Φ. ( Φ l(φ, x, y)).shape = Φ.shape 13

14 Total Gradient Descent l train (Φ) = 1 N n l(φ, x n, y n ) We want: Φ = argmin Φ l train (Φ) Φ -= η Φ l train (Φ) 14

15 Stochastic Gradient Descent (SGD) on the training set. repeat: Select n at random. Φ -= η Φ l(φ, x n, y n ) E n Φ l(φ, x n, y n ) = n P (n) Φ l train (Φ, x n, y n ) = 1 N n Φ l train (Φ, x n, y n ) = Φ 1 N n = Φ l train (Φ) 15 l train (Φ, x n, y n )

16 Epochs In practice we cycle through the training data visiting each training pair once. One pass through the training data is called an Epoch. One typically imposes a random suffle of the training data before each epoch. 16

17 SGD for MLPs j: feature l: layer ŷ: possible label L 0 [j] = Input L l+1 [j ] = σ j W l+1 [j, j] L l+1 [j] + b l+1 [j ] L N [ŷ] = σ j W N 1 [ŷ, j] L N 1 [j] + b N 1 [ŷ] P (ŷ) = 1 Z eln [ŷ]

18 END

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by