HOMEWORK #4: LOGISTIC REGRESSION

Similar documents
HOMEWORK #4: LOGISTIC REGRESSION

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Regression with Numerical Optimization. Logistic

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Homework 6: Image Completion using Mixture of Bernoullis

Programming Assignment 4: Image Completion using Mixture of Bernoullis

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning, Fall 2012 Homework 2

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Why should you care about the solution strategies?

Overfitting, Bias / Variance Analysis

Classification with Perceptrons. Reading:

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Introduction to Machine Learning HW6

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Day 3 Lecture 3. Optimizing deep networks

Logistic Regression. William Cohen

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Lecture 5: Logistic Regression. Neural Networks

Machine Learning

Statistical Machine Learning Hilary Term 2018

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear classifiers: Overfitting and regularization

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Regression (9/11/13)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

CS229 Supplemental Lecture notes

Comparison of Modern Stochastic Optimization Algorithms

Variations of Logistic Regression with Stochastic Gradient Descent

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Week 5: Logistic Regression & Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Introduction to Logistic Regression

Lecture 3 - Linear and Logistic Regression

Introduction to Machine Learning

Least Mean Squares Regression

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Linear Models in Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Homework 5. Convex Optimization /36-725

Probabilistic Graphical Models

HOMEWORK 4: SVMS AND KERNELS

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Qualifying Exam in Machine Learning

Generative Learning algorithms

Bias-Variance Tradeoff

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Topic 3: Neural Networks

Machine Learning

CPSC 540: Machine Learning

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Linear Regression (continued)

Generative v. Discriminative classifiers Intuition

ECE521 week 3: 23/26 January 2017

Optimization and Gradient Descent

Computational statistics

Stochastic gradient descent; Classification

Stochastic Gradient Descent

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

Neural Network Training

Linear Models for Regression CS534

Lecture 17: Neural Networks and Deep Learning

15780: GRADUATE AI (SPRING 2018) Homework 3: Deep Learning and Probabilistic Modeling

CS260: Machine Learning Algorithms

Linear classifiers: Logistic regression

Advanced computational methods X Selected Topics: SGD

10-701/ Machine Learning - Midterm Exam, Fall 2010

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Linear & nonlinear classifiers

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

ABC-LogitBoost for Multi-Class Classification

MATH 680 Fall November 27, Homework 3

Least Mean Squares Regression. Machine Learning Fall 2018

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Midterm exam CS 189/289, Fall 2015

Probabilistic Graphical Models & Applications

Introduction to Machine Learning

Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

Machine Learning

CPSC 340 Assignment 4 (due November 17 ATE)

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Logistic Regression. Seungjin Choi

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Transcription:

HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a zip file containing both your report and the code files. INTRODUCTION Model Definition. Logistic regression is a simple yet powerful and widely used binary classifier. Given a (1 d)-dimensional feature vector x, a d-dimensional vector of real-valued parameters β, a real-valued bias parameter β 0, and an output variable θ (0, 1), the logistic regression classifier is written as 1 ˆθ = f(xβ + β 0 ) where f(z) = 1 + e z. f( ) is known as the sigmoid or logistic function, which is where logitic regression gets its name. For simplicity, from here forward, we will absorb the bias parameter into the vector by appending a 1 to all feature vectors, making x and β (d + 1)-dimensional, which allows us to write ˆθ = f(xβ). Learning. To train a logistic regression model, we of course need labels: an observed class assignment y {0, 1} for a given feature vector x. Although logistic regression, at first glance, may seem different than the simple models we have been working with so far (ex: Gaussians, Betas, Gammas, Multinomials, etc), we still learn the parameters β via the maximum likelihood framework. Yet, note, here we are working with what s called a conditional model of the form p(y x, θ), which assumes access to another data source (the features) to be fully specified. The training objective is derived as follows; we start with just one observation {x, y} for notational simplicity. Because the variable we want to model the class label y is binary, we use the natural distribution for binary values: the Bernoulli model. And since ˆθ is between zero and one, we treat it as the Bernoulli s success probability parameter we defined the logistic regression model above with this goal in mind. Then, just as we usually do in the maximum likelihood framework, we start by writing down the log probability: log p(y x, β) = log[ˆθ y (1 ˆθ) 1 y ] (inside the log is the Bernoulli p.m.f.) = y log ˆθ + (1 y) log(1 ˆθ) = y log f(xβ) + (1 y) log(1 f(xβ)). You may have seen this equation before as the cross-entropy error function. To arrive at the third line, all we did was substitute ˆθ = f(xβ). The intuition behind the equations above is that we are learning a Bernoulli model for the observed class labels, but instead of learning just one fixed Bernoulli parameter, we model the success parameter as function of the features x. Defining the objective for a whole dataset D = {D x, D y } = {(x 1,..., x i,..., x N ), (y 1,..., y i,..., y N )), we have (1) N log p(d y D x, β) = log p(y i x i, β) = i=1 N y i log f(x i β) + (1 y i ) log(1 f(x i β)). i=1 (2) We need to solve Equation 2 for the maximum likelihood estimate ˆβ MLE. This is where the process changes from what we did previously. If we try to set β p(d y D x, β) = 0 and solve for β, we will quickly find ourselves stuck, unable to find a closed-form solution. This is because our use of the logistic function has made our function non-convex with respect to β. Thus we must optimize via 1

an iterative gradient ascent 1 method. In other words, we need to iterate the following equation to gradually send β p(d y D x, β) 0, which will indicate we re at a local maximum: ˆβ t+1 = ˆβ t + α β log p(d y D x, ˆβ t ) (3) where ˆβ t is the current estimates of the parameters and α is known as the learning rate, the size of the steps we take when ascending the function. α can be fixed, but usually we have a gradually decreasing learning rate (i.e. α t 0 as t increases). Using Stochastic Gradients. In Part 2 of Problem #2, we will vary what is called the training batch size. All this means is that we will approximate the gradient of the full dataset using only a subset of the data: K β log p(d y D x, β) β log p(y k x k, β) such that K << N, (4) k=1 where K is the batch size parameter. Training using this approximation is known as stochastic gradient ascent/descent, as we are using a stochastic approximation of the gradient. This method is commonly used because the gradient is usually well approximated by only a few instances, and therefore we can make parameter updates much faster than if we were computing the gradient using all N data points. One important implementation detail to note when using and comparing various batch sizes is to take the average when computing Equation 2, not a straight sum. This is necessary because averaging standardizes the learning rate across various batch sizes. This doesn t change the optimization objective since 1/K is a constant that can be absorbed into the learning rate. Prediction and Evaluation. After the training procedure has converged (i.e. the derivatives are close to zero), we can use our trained logistic regression model on a never-before-seen (test) dataset of features Dx test as follows. For a feature vector x j Dx test, we predict it s label according to { 1, if ŷ j = ˆθ j 0.5 0, if ˆθ j < 0.5 where ˆθ j = f(x j ˆβT ). (5) The subscript T on the parameters denotes that these are the estimates as found on the last update (i.e. t [0, T 1]). Notice that we are generating the prediction by taking the mode of the predicted Bernoulli distribution, or in other words, rounding ˆθ j. If we do have labels for the test dataset call them Dy test we can use them to evaluate our classifier via the following two metrics. The first one is test-set log likelihood, and it is calculated using the same function we used for training: log p(d test y Dx test, ˆβ T ) = log = M j=1 M j=1 p(yj test x test j, ˆβ T ) yj test log ˆθ j + (1 yj test ) log(1 ˆθ j ), (6) where, as above, ˆθ j = f(x test j ˆβ T ). The second metric calculates classification performance directly. Define the classification accuracy metric as follows: A = 1 M M j=1 I[y test j = ŷ j ] (7) where I is the indicator function and takes value 1 if the true test label matches the predicted label and 0 if there is a mis-match. A, then, simply counts the number of correct predictions and divides by the size of the test set, M. 1 Gradient descent and gradient ascent are essentially the same procedure. Turning one into the other is just a matter of placing a negative sign in front of the objective and moving the opposite direction (adding or subtracting the gradient term in the update). 2

Figure 1: Example plot for Problem 2.1. The plot shows the log likelihood for during the training of three models, the log likelihood on the validation data for each model, and the log likelihood on the test set for the best model. Dataset. In this homework, we will be implementing a logistic regression classifier and therefore need some dataset to train it on. For this we will use a very popular dataset of handwritten digit images called MNIST. You will find this dataset in many machine learning publications; it is mostly used for illustrative purposes or as a sanity check since it s easy to train a high accuracy model for it. The full version of MNIST has ten classes, but we ve reduced the dataset down to only 2 s and 6 s to formulate a binary classification problem. We have already segmented the data into training, validation, and testing splits of 9467 instances, 2367 instances, and 1915 instances respectively. Each image is represented as a vector with 784 dimensions, each dimension corresponding to pixel intensity between zero and one, and therefore the training features, for instance, comprise a matrix of size 9467 784. The data can be downloaded in CSV format from the EEE dropbox for the assignment under the CourseFiles folder. PROBLEM #1: GRADIENT DERIVATION Given: We start by filling in the details of Equation 2. Be sure to be explicit and precise in your solutions so even if someone did not have the prompt, they could follow your work. You can start from the last line of Equation 2. You can also assume the derivative of the logistic function is given: z f(z) = f(z)(1 f(z)) z. Part 1 (15 points): Derive an expression for β log p(d y D x, ˆβ t ). Part 2 (5 points): State the intuition behind the expression i.e. why should sending the gradient to zero make the classifier work? PROBLEM #2: IMPLEMENTATION Given: Now we can implement a logistic regression classifier. Using the programming language of your choice, write a script that trains a logistic regression model and evaluates the model by calculating log likelihood (Equation 6) and accuracy (Equation 7). Some skeleton Python code is provided in the Appendix (and in EEE dropbox). You can choose not to use this code at all if you wish. The only requirement is that no logistic regression or auto-differentiation libraries can 3

be used i.e. a library that implements/trains a logistic regression model as a function call or that calculates gradients automatically. For all parts, take the number of epochs, i.e. passes through the training data, to be 250. The total number of updates will then be T = 250 N/batch_size. Each subproblem below gives three parameter settings for learning rate (part 1) or batch size (problem 2). Train three models, one for each setting, and once training is complete, calculate the accuracy and log likelihood on the validation set. And then, for the model that had the highest validation accuracy, calculate the test-set accuracy and log likelihood. Report the log likelihood numbers in a plot similar to Figure 1; report the accuracies numerically. Include all code for your implementation. Part 1 (20 points): The first task is to train three logistic regression models, one for each of the following learning rates (α in Equation 3): 0.001, 0.01, 0.1. Report the results as described above. In a few sentences, comment on the results. Part 2 (20 points): The second task is to train three more logistic regression models, one for each of the following batch sizes (K in Equation 4): 1000, 100, 10. Report the results as described above. In a few sentences, comment on the results. PROBLEM #3: ADAPTIVE LEARNING RATES Given: In Problem # 2, we fixed the learning rate across all updates. Here we will look at how to intelligently change the learning rate as training progresses. We will experimentally compare the following three ways to change the step size. 1. Robbins-Monro Schedule: One method is to use a deterministic schedule to decay the learning rate as training progresses. The Robbins-Monro schedule, suggested because of some theoretical properties we won t discuss here, is written as α t = α 0 /t for t [1, T 1]. (8) In other words, we set the learning rate at time t by dividing some fixed, initial learning rate α 0 by the current time counter. 2. AdaM: A second method is called AdaM, for (Ada)ptive (M)oments. The method keeps a geometric mean and variance of the gradients observed so far and updates the parameters according to ˆβ t+1 = ˆβ ˆm t t + α 0 ˆvt + ɛ, (9) where α 0 is some fixed constant, ˆm t is the geometric mean of the past gradients, ˆv t the geometric variance, and ɛ is some small value to keep the denominator from being zero. Complete Python code for performing AdaM updates is provided in the Appendix. 3. Newton-Raphson: The last method we will consider is Newton-Raphson, also called Newton s Method. This is somewhat of the gold-standard for adaptively setting the learning rate as there are no parameters to choose ourselves. We simply set the learning rate to the inverse Hessian matrix, which in the case of logistic regression is: α t = (X T A t X) 1 (10) where A t is a diagonal matrix with a t i,i = ˆθ i (1 ˆθ i ) (i.e. the predicted Bernoulli probability using the parameter estimates at time t) and X is the N d matrix of features (K d for batch size K when using stochastic gradient methods). The problem with Newton s method is that computing the matrix inverse for each update is very costly. It s around O(d 3 ), which for MNIST is O(784 3 ) = O(481, 890, 304). Part 1 (30 points): Implement Robbins-Monro, AdaM, and Newton-Raphson updates for the logistic regression model you implemented in Problem #2. Python code for AdaM is provided in the Appendix. Set α 0 to 0.1 for Robbins-Monro and set α 0 to 0.001 for Adam. Use a batch size 2 of 2 Newton s method will take a few minutes to train for 50 epochs. Use a much bigger batch size to first make sure your Hessian calculation is correct. 4

200 and 50 epochs (passes through the full training data) for each method. Report the same plot and accuracies you did for Problem #2: one training curve and validation log likelihood for each method, test log likelihood for best method according to validation performance. Discuss what you observe in a few sentences. Include all code for your implementation. Part 2 (10 points): Answer the following three questions. 1. What s the intuition behind Newton s Method? In other words, what does the Hessian tell us about a given location on the function? 2. What s the intuition behind the AdaM update? Think about what happens to the step size when the variance of the gradients (ˆv t ) is large and when it is small. Is the underlying mechanism similar to Newton s method s use of the Hessian? 3. State the runtime and memory complexity (in big O notation) in terms of the feature dimensionality d and the batch size K for each update method. APPENDIX STARTER IMPLEMENTION 1 i m p o r t numpy as np 2 i m p o r t pandas as pd 3 i m p o r t m a t p l o t l i b. p y p l o t as p l t 4 i m p o r t copy 5 6 ### H e l p e r F u n c t i o n s ### 7 8 d e f l o g i s t i c _ f n ( z ) : 9 ### TO DO ### 10 11 d e f l o a d _ d a t a _ p a i r s ( t y p e _ s t r ) : 12 r e t u r n pd. r e a d _ c s v ( " m n i s t _ 2 s _ a n d _ 6 s / "+ t y p e _ s t r +" _x. csv " ). v a l u e s, pd. r e a d _ c s v ( " m n i s t _ 2 s _ a n d _ 6 s / "+ t y p e _ s t r +" _y. csv " ). v a l u e s 13 14 d e f r u n _ l o g _ r e g _ m o d e l ( x, b e t a ) : 15 ### TO DO ### 16 17 d e f c a l c _ l o g _ l i k e l i h o o d ( x, y, b e t a ) : 18 t h e t a _ h a t s = r u n _ l o g _ r e g _ m o d e l ( x, b e t a ) 19 ### R e t u r n an average, n o t a sum! 20 ### TO DO ### 21 22 d e f c a l c _ a c c u r a c y ( x, y, b e t a ) : 23 t h e t a _ h a t s = r u n _ l o g _ r e g _ m o d e l ( x, b e t a ) 24 ### TO DO ### 25 26 27 ### Model T r a i n i n g ### 28 29 d e f t r a i n _ l o g i s t i c _ r e g r e s s i o n _ m o d e l ( x, y, beta, l e a r n i n g _ r a t e, b a t c h _ s i z e, max_epoch ) : 30 b e t a = copy. deepcopy ( b e t a ) 31 n _ b a t c h e s = x. shape [ 0 ] / b a t c h _ s i z e 32 t r a i n _ p r o g r e s s = [ ] 33 34 f o r epoch_idx i n x r a n g e ( max_epoch ) : 35 f o r b a t c h _ i d x i n x r a n ge ( n _ b a t c h e s ) : 36 37 ### TO DO ### 38 39 # perform u p d a t e s 40 b e t a += l e a r n i n g _ r a t e b e t a _ g r a d 41 5

42 t r a i n _ p r o g r e s s. append ( c a l c _ l o g _ l i k e l i h o o d ( x, y, b e t a ) ) 43 p r i n t " Epoch %d. T r a i n Log L i k e l i h o o d : %f " %(epoch_idx, t r a i n _ p r o g r e s s [ 1]) 44 45 r e t u r n beta, t r a i n _ p r o g r e s s 46 47 48 i f name == " main " : 49 50 ### Load t h e d a t a 51 t r a i n _ x, t r a i n _ y = l o a d _ d a t a _ p a i r s ( " t r a i n " ) 52 v a l i d _ x, v a l i d _ y = l o a d _ d a t a _ p a i r s ( " v a l i d " ) 53 t e s t _ x, t e s t _ y = l o a d _ d a t a _ p a i r s ( " t e s t " ) 54 55 # add a one f o r t h e b i a s term 56 t r a i n _ x = np. h s t a c k ( [ t r a i n _ x, np. ones ( ( t r a i n _ x. shape [ 0 ], 1 ) ) ] ) 57 v a l i d _ x = np. h s t a c k ( [ v a l i d _ x, np. ones ( ( v a l i d _ x. shape [ 0 ], 1 ) ) ] ) 58 t e s t _ x = np. h s t a c k ( [ t e s t _ x, np. ones ( ( t e s t _ x. shape [ 0 ], 1 ) ) ] ) 59 60 ### I n i t i a l i z e model p a r a m e t e r s 61 b e t a = np. random. normal ( s c a l e =. 0 0 1, s i z e =( t r a i n _ x. shape [ 1 ], 1 ) ) 62 63 ### S e t t r a i n i n g p a r a m e t e r s 64 l e a r n i n g _ r a t e s = [1 e 3, 1e 2, 1e 1] 65 b a t c h _ s i z e s = [ t r a i n _ x. shape [ 0 ] ] 66 max_epochs = 250 67 68 ### I t e r a t e over t r a i n i n g p a r a m e t e r s, t e s t i n g a l l c o m b i n a t i o n s 69 v a l i d _ l l = [ ] 70 v a l i d _ a c c = [ ] 71 a l l _ p a r a m s = [ ] 72 a l l _ t r a i n _ l o g s = [ ] 73 74 f o r l r i n l e a r n i n g _ r a t e s : 75 f o r bs i n b a t c h _ s i z e s : 76 ### t r a i n model 77 f i n a l _ p a r a m s, t r a i n _ p r o g r e s s = t r a i n _ l o g i s t i c _ r e g r e s s i o n _ m o d e l ( t r a i n _ x, t r a i n _ y, beta, l r, bs, max_epochs ) 78 a l l _ p a r a m s. append ( f i n a l _ p a r a m s ) 79 a l l _ t r a i n _ l o g s. append ( ( t r a i n _ p r o g r e s s, " L e a r n i n g r a t e : %f, Batch s i z e : %d " %( l r, bs ) ) ) 80 81 ### e v a l u a t e model on v a l i d a t i o n d a t a 82 v a l i d _ l l. append ( c a l c _ l o g _ l i k e l i h o o d ( v a l i d _ x, v a l i d _ y, f i n a l _ p a r a m s ) ) 83 v a l i d _ a c c. append ( c a l c _ a c c u r a c y ( v a l i d _ x, v a l i d _ y, f i n a l _ p a r a m s ) ) 84 85 ### Get b e s t model 86 b e s t _ m o d e l _ i d x = np. argmax ( v a l i d _ a c c ) 87 b e s t _ p a r a m s = a l l _ p a r a m s [ b e s t _ m o d e l _ i d x ] 88 t e s t _ l l = c a l c _ l o g _ l i k e l i h o o d ( t e s t _ x, t e s t _ y, b e s t _ p a r a m s ) 89 t e s t _ a c c = c a l c _ a c c u r a c y ( t e s t _ x, t e s t _ y, b e s t _ p a r a m s ) 90 p r i n t " V a l i d a t i o n A c c u r a c i e s : "+ s t r ( v a l i d _ a c c ) 91 p r i n t " T e s t Accuracy : %f " %( t e s t _ a c c ) 92 93 ### P l o t 94 p l t. f i g u r e ( ) 95 epochs = r a n g e ( max_epochs ) 96 f o r idx, l o g i n enumerate ( a l l _ t r a i n _ l o g s ) : 97 p l t. p l o t ( epochs, l o g [ 0 ],, l i n e w i d t h =3, l a b e l =" T r a i n i n g, "+ l o g [ 1 ] ) 98 p l t. p l o t ( epochs, max_epochs [ v a l i d _ l l [ i d x ] ],, l i n e w i d t h =5, l a b e l =" V a l i d a t i o n, "+ l o g [ 1 ] ) 6

99 p l t. p l o t ( epochs, max_epochs [ t e s t _ l l ],, ms=8, l a b e l =" T e s t i n g, "+ a l l _ t r a i n _ l o g s [ b e s t _ m o d e l _ i d x ] [ 1 ] ) 100 101 p l t. x l a b e l ( r " Epoch ( $ t $ ) " ) 102 p l t. y l a b e l ( " Log L i k e l i h o o d " ) 103 p l t. ylim ( [. 8, 0. ] ) 104 p l t. t i t l e ( "MNIST R e s u l t s f o r V a r i o u s L o g i s t i c R e g r e s s i o n Models " ) 105 p l t. l e g e n d ( l o c =4) 106 p l t. show ( ) ADAM UPDATE 1 2 d e f get_adam_update ( alpha_0, grad, adam_values, b1 =. 9 5, b2 =. 9 9 9, e=1e 8) : 3 adam_values [ t ] += 1 4 5 # u p d a t e mean 6 adam_values [ mean ] = b1 adam_values [ mean ] + (1 b1 ) grad 7 m_hat = adam_values [ mean ] / (1 b1 adam_values [ t ] ) 8 9 # u p d a t e v a r i a n c e 10 adam_values [ v a r ] = b2 adam_values [ v a r ] + (1 b2 ) grad 2 11 v _ h a t = adam_values [ v a r ] / (1 b2 adam_values [ t ] ) 12 13 r e t u r n a l p h a _ 0 m_hat / ( np. s q r t ( v _ h a t ) + e ) 14 15 # I n i t i a l i z e a d i c t i o n a r y t h a t keeps t r a c k of t h e mean, v a r i a n c e, and u p d a t e c o u n t e r 16 a l p h a _ 0 = 1e 3 17 adam_values = \ 18 { mean : np. z e r o s ( b e t a. shape ), v a r : np. z e r o s ( b e t a. shape ), t : 0} 19 20 ### I n s i d e t h e t r a i n i n g loop do ### 21 b e t a _ g r a d = # compute g r a d i e n t w. r. t. t h e weight v e c t o r ( b e t a ) as u s u a l 22 b e t a _ u p d a t e = get_adam_update ( alpha_0, b e t a _ g r a d, adam_values ) 23 b e t a += b e t a _ u p d a t e 7