HOMEWORK #4: LOGISTIC REGRESSION

Similar documents
HOMEWORK #4: LOGISTIC REGRESSION

Homework 6: Image Completion using Mixture of Bernoullis

Classification with Perceptrons. Reading:

Machine Learning, Fall 2012 Homework 2

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Overfitting, Bias / Variance Analysis

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear Regression (9/11/13)

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning HW6

Regression with Numerical Optimization. Logistic

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear classifiers: Overfitting and regularization

Homework 5. Convex Optimization /36-725

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE521 week 3: 23/26 January 2017

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Comparison of Modern Stochastic Optimization Algorithms

Probabilistic Graphical Models

Linear Regression (continued)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Day 3 Lecture 3. Optimizing deep networks

Introduction to Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Why should you care about the solution strategies?

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Probabilistic Graphical Models & Applications

Statistical Data Mining and Machine Learning Hilary Term 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

x k+1 = x k + α k p k (13.1)

Generative v. Discriminative classifiers Intuition

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Introduction to Machine Learning

Neural Network Training

Lecture 5: Logistic Regression. Neural Networks

Lecture 3 - Linear and Logistic Regression

Machine Learning: Assignment 1

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Bayesian Machine Learning

Linear Models for Regression CS534

Reading Group on Deep Learning Session 1

CSC411 Fall 2018 Homework 5

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Managing Uncertainty

Stochastic gradient descent; Classification

Midterm exam CS 189/289, Fall 2015

Statistical Machine Learning Hilary Term 2018

Generative Learning algorithms

Week 3: Linear Regression

Week 5: Logistic Regression & Neural Networks

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Stochastic Gradient Descent

Least Mean Squares Regression. Machine Learning Fall 2018

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Linear Models in Machine Learning

Linear Models for Regression CS534

Logistic Regression. William Cohen

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Topic 3: Neural Networks

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Linear and Logistic Regression. Dr. Xiaowei Huang

Bias-Variance Tradeoff

CPSC 340 Assignment 4 (due November 17 ATE)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

ABC-LogitBoost for Multi-Class Classification

CPSC 540: Machine Learning

15780: GRADUATE AI (SPRING 2018) Homework 3: Deep Learning and Probabilistic Modeling

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Nonparametric Bayesian Methods (Gaussian Processes)

Sample questions for Fundamentals of Machine Learning 2018

Least Mean Squares Regression

CS260: Machine Learning Algorithms

Introduction to Logistic Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Bayesian Learning. Machine Learning Fall 2018

6.867 Machine Learning

10-701/15-781, Machine Learning: Homework 4

Transcription:

HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your code to HW4 EEE Dropbox INTRODUCTION Model Definition. Logistic regression is a simple yet powerful and widely used binary classifier. Given a (1 d)-dimensional feature vector x, a d-dimensional vector of real-valued parameters β, a real-valued bias parameter β 0, and an output variable θ (0, 1), the logistic regression classifier is written as 1 ˆθ = f(xβ + β 0 ) where f(z) = 1 + e z. f( ) is known as the sigmoid or logistic function, which is where logistic regression gets its name. For simplicity, from here forward, we will absorb the bias parameter into the vector by appending a 1 to all feature vectors, which allows us to write ˆθ = f(xβ). Learning. To train a logistic regression model, we of course need labels: an observed class assignment y {0, 1} for a given feature vector x. Although logistic regression, at first glance, may seem different than the simple density and distribution models (such as Gaussians, Betas, Gammas, Multinomials, etc), we still learn the parameters β via the maximum likelihood framework. Yet, note, here we are working with what s called a conditional model of the form p(y x, θ), which assumes access to another data source (the features). The training objective is derived as follows; we start with just one observation {x, y} for notational simplicity. Because the variable we want to model the class label y is binary, we use the natural distribution for binary values: the Bernoulli model. And since ˆθ is between zero and one, we treat it as the Bernoulli s success probability parameter we defined the logistic regression model above with this goal in mind. Then, just as we usually do in the maximum likelihood framework, we start by writing down the log probability: log p(y x, β) = log[ˆθ y (1 ˆθ) 1 y ] (inside the log is the Bernoulli p.m.f.) = y log ˆθ + (1 y) log(1 ˆθ) = y log f(xβ) + (1 y) log(1 f(xβ)). You may have seen this equation before as the cross-entropy or the negative log-loss error function. To arrive at the third line, all we did was substitute ˆθ = f(xβ). The intuition behind the equations above is that we are learning a Bernoulli model for the observed class labels, but instead of learning just one fixed Bernoulli parameter, we model the success parameter as function of the features x. Defining the objective for a whole dataset D = {D x, D y } = {(x 1,..., x i,..., x N ), (y 1,..., y i,..., y N )), and assuming that the observations are conditionally independent given the parameters β, we have log p(d y D x, β) = log = N p(y i x i, β) i=1 N y i log f(x i β) + (1 y i ) log(1 f(x i β)). i=1 We need to solve Equation 2 for the maximum likelihood estimate ˆβ MLE. This is where the process is different from what we typically did with estimating maximum likelihood parameters for simple densities/distributions such as Gaussians or multinomials. If we try to set β p(d y D x, β) = 0 and solve for β, we will quickly find ourselves stuck, unable to find a closed-form solution (we (1) (2) 1

end up with a coupled set of non-linear equations in the unknowns β). One way to solve such problems is to optimize via iterative gradient ascent 1 method. Let β p(d y D x, β) be the gradient of the log-likelihood evaluated at the our current guess β for the parameters this is a vector of partial derivatives, one partial derivative for each component (each parameter) in the vector β. The gradient points in the steepest uphill direction (locally) of the log-likelihood so we can take a small step in this direction to move uphill (the step needs to be small since steepest" is only true at the current location). In other words, we need to iterate the following equation to gradually send β p(d y D x, β) 0, which will indicate we re at a local maximum: ˆβ t+1 = ˆβ t + α β log p(d y D x, ˆβ t ) (3) where ˆβ t is the current estimate of the parameters and α is known as the learning rate, the size of the steps we take when ascending the function. α can be fixed, but usually we have a gradually decreasing learning rate (i.e. α t 0 as t increases). Using Stochastic Gradients. In Part 2 of Problem #2, we will vary what is called the training batch size. All this means is that we will approximate the gradient of the full dataset using only a subset of the data: K β log p(d y D x, β) β log p(y k x k, β) such that K << N, (4) k=1 where K is the batch size parameter. Training using this approximation is known as stochastic gradient ascent/descent, as we are using a stochastic approximation of the gradient. This method is commonly used because the gradient is usually well approximated by only a few instances, and therefore we can make parameter updates much faster than if we were computing the gradient using all N data points. Prediction and Evaluation. After the training procedure has converged (i.e. the derivatives are close to zero), we can use our trained logistic regression model on a holdout (test) dataset of features Dx test as follows. For a feature vector x j Dx test, we predict it s label according to { 1, if ŷ j = ˆθ j 0.5 0, if ˆθ j < 0.5 where ˆθ j = f(x j ˆβ T ). (5) The subscript T on the parameters denotes that these are the estimates as found on the last update (i.e. t [0, T 1]). Notice that we are generating the prediction by using a point estimate ˆβ T of the parameter vector (an alternative Bayesian approach would be to average over the posterior density of the β given the data, but even for a relatively simple model like the logistic regression model, computing this posterior density is in general tractable (there is no conjugate prior and no closed form solution for the posterior). Assuming that we have labels for the test dataset call them Dy test we can use them to evaluate our classifier via the following two metrics. The first one is test-set log likelihood, and it is calculated using the same function we used for training: log p(d test y Dx test, ˆβ T ) = log = M j=1 M j=1 p(yj test x test j, ˆβ T ) yj test log ˆθ j + (1 yj test ) log(1 ˆθ j ), where, as above, ˆθ j = f(x test j ˆβ T ) and M is the number of examples in our test set. The second metric calculates classification performance directly. Define the classification accuracy metric as follows: A = 1 M M j=1 (6) I[y test j = ŷ j ] (7) 1 Gradient descent and gradient ascent are essentially the same procedure. Turning one into the other is just a matter of placing a negative sign in front of the objective and moving the opposite direction (adding or subtracting the gradient term in the update). 2

where I is the indicator function and takes value 1 if the true test label matches the predicted label and 0 if there is a mis-match. A, then, simply counts the number of correct predictions and divides by the size of the test set, M. Dataset. In this homework, we will be implementing a logistic regression classifier and therefore need a dataset to train it on. For this we will use a very popular dataset of handwritten digit images called MNIST. You will find this dataset in many machine learning publications; it is mostly used for illustrative purposes or as a sanity check since it s easy to train a high accuracy model for it. The full version of MNIST has ten classes, but we ve reduced the dataset down to only 2 s and 6 s to formulate a binary classification problem. We have already segmented the data into training, validation, and testing splits of 9467 instances, 2367 instances, and 1915 instances respectively. Each image is represented as a vector with 784 dimensions, each dimension corresponding to pixel intensity between zero and one, and therefore the training features, for instance, comprise a matrix of size 9467 784. The data can be downloaded in CSV format here: www.ics.uci.edu/~smyth/courses/cs274/homeworks/hw4_materials (You can also find this link on the class Website). Each data vector in the x files" corresponds to the input pixel data for a particular image. For example, the ith row in file trainx.csv is the ith training example x i. Each such example is a vector with 784 pixel values, corresponding to a 28 by 28 pixel image of a handwritten 2" or 6". (Optional: If you wish to see what the data looks like you can write a simple function that plots each vector as a square image you may need to rotate and flip the image to see what the digits really look like). 3

Figure 1: Example plot for Problem 2.1. The plot shows the log likelihood during the training of three models, the log likelihood on the validation data for each model, and the log likelihood on the test set for the best model. PROBLEM #1: GRADIENT DERIVATION We start by filling in the details of gradient descent. You can use the fact that the derivative of the logistic function is known: z f(z) = f(z)(1 f(z)) z. (20 points): Derive the equation for the gradient for this problem: β log p(d y D x, ˆβ t ). PROBLEM #2: IMPLEMENTATION Given: Next you will implement a logistic regression classifier. Using the programming language of your choice, write a script that (a) trains a logistic regression model and (b) evaluates a model (with some setting of the parameters) by calculating log likelihood (Equation 6) and accuracy (Equation 7). Some skeleton Python code is provided in the Appendix and on the class Website. You can choose not to use this code at all if you wish. The only requirement is that no logistic regression or auto-differentiation libraries can be used i.e. a library that implements/trains a logistic regression model as a function call or that calculates gradients automatically. For all parts, take the number of epochs, i.e. passes through the training data, to be 250. The total number of updates will then be T = 250 N/batch_size. Each subproblem below gives three parameter settings for learning rate (part 1) or batch size (problem 2). Train three models, one for each setting, and once training is complete, calculate the accuracy and log likelihood on the validation set. And then, for the model that had the highest classification accuracy on the validation set, calculate the test-set accuracy and log likelihood. Report the log likelihood numbers in a plot similar to Figure 1 where you show the log-likelihood on the training data as a function of the number of epochs and show the log-likelihood (after epoch 250) on the validation set (as a straight line horizontally). Report the classification accuracies numerically in a table (both on validation data, and for the best validation model, on the test data). Include all code for your implementation. Part 1 (20 points): The first task is to train three logistic regression models, one for each of the following learning rates (α in Equation 3): 0.001, 0.01, 0.1. Report the results as described above. In a few sentences, comment on the results. 4

Part 2 (20 points): The second task is to train three more logistic regression models, one for each of the following batch sizes (K in Equation 4): 1000, 100, 10. Report the results as described above. In a few sentences, comment on the results. You can use any learning rate here, e.g., 0.01. PROBLEM #3: ADAPTIVE LEARNING RATES Given: In Problem # 2, we fixed the learning rate across all updates. Here we will look at how to intelligently change the learning rate as training progresses. We will experimentally compare the following three ways to change the step size. 1. Robbins-Monro Schedule: One method is to use a deterministic schedule to decay the learning rate as training progresses. The Robbins-Monro schedule, suggested because of some theoretical properties we won t discuss here, is written as α t = α 0 /t for t [1, T 1]. (8) In other words, we set the learning rate at time t by dividing some fixed, initial learning rate α 0 by the current time counter. 2. AdaM: A second method is called AdaM, for (Ada)ptive (M)oments. The method keeps a geometric mean and variance of the gradients observed so far and updates the parameters according to ˆβ t+1 = ˆβ ˆm t t + α 0 ˆvt + ɛ, (9) where α 0 is some fixed constant, ˆm t is the geometric mean of the past gradients, ˆv t the geometric variance, and ɛ is some small value to keep the denominator from being zero. Complete Python code for performing AdaM updates is provided in the Appendix. 3. Newton-Raphson: The last method we will consider is Newton-Raphson, also called Newton s Method. This is somewhat of the gold-standard for adaptively setting the learning rate as there are no parameters to choose ourselves. We simply replace the learning rate by the inverse Hessian matrix (size d d, which in the case of logistic regression is: α t = (X T A t X) 1 (10) where α t is a d d matrix, where A t is a diagonal matrix with a t i,i = ˆθ i (1 ˆθ i ) (i.e. the predicted Bernoulli probability using the parameter estimates at time t) and X is the N d matrix of features (K d for batch size K when using stochastic gradient methods). Note that there are two issues with Newton s method that limit its usefulness in highdimensional problems: (a) The first significant issue with Newton s method is that computing the matrix inverse for each update is very costly computationally. It s around O(d 3 ), which for MNIST is O(784 3 ) = O(481, 890, 304). So, doing a Newton update may be significantly slower than the other methods when you run your code. (b) The other significant issue with Newton s method is that computing the inverse may be numerically unstable (you may get singular matrices). One way to avoid this is to add a small constant to each of the entries on the diagonal of the Hessian matrix, i.e., replace the Hessian H with H + ɛi where I is the identity matrix. Your code can automatically start with a small ɛ value of around 0.001 and it can keep increasing it until it doesn t get a singular matrix when you invert H. (This is somewhat ad hoc, but should work). Another option you could experiment with is to use the pseudoinverse of the Hessian. Part 1 (30 points): Implement Robbins-Monro, AdaM, and Newton-Raphson updates for the logistic regression model you implemented in Problem #2. Python code for AdaM is provided in the Appendix. Set α 0 to 0.1 for Robbins-Monro and set α 0 to 0.001 for Adam. Use a batch size 2 of 200 and 50 epochs (passes through the full training data) for each method. Report the same plot and 2 Newton s method will take a few minutes to train for 50 epochs. Use a much bigger batch size to first make sure your Hessian calculation is correct. 5

accuracies you did for Problem #2. There will be one training curve and validation log likelihood for each method. For accuracy there will be a table with the classification accuracy of each method on the validation data and the accuracy of the best method on the test data. Discuss what you observe in a few sentences. Include all code for your implementation. Part 2 (10 points): Answer the following three questions. 1. What s the intuition behind Newton s Method? In other words, what does the Hessian tell us about a given location on the function? 2. What s the intuition behind the AdaM update? Think about what happens to the step size when the variance of the gradients (ˆv t ) is large and when it is small. Is the underlying mechanism similar to Newton s method s use of the Hessian? 3. State the runtime and memory complexity (in big O notation) in terms of the feature dimensionality d and the batch size K for each update method. APPENDIX STARTER IMPLEMENTION 1 i m p o r t numpy as np 2 i m p o r t pandas as pd 3 i m p o r t m a t p l o t l i b. p y p l o t as p l t 4 i m p o r t copy 5 6 ### H e l p e r F u n c t i o n s ### 7 8 d e f l o g i s t i c _ f n ( z ) : 9 ### TO DO ### 10 11 d e f l o a d _ d a t a _ p a i r s ( t y p e _ s t r ) : 12 r e t u r n pd. r e a d _ c s v ( " m n i s t _ 3 s _ a n d _ 7 s / "+ t y p e _ s t r +" _x. csv " ). v a l u e s, pd. r e a d _ c s v ( " m n i s t _ 3 s _ a n d _ 7 s / "+ t y p e _ s t r +" _y. csv " ). v a l u e s 13 14 d e f r u n _ l o g _ r e g _ m o d e l ( x, b e t a ) : 15 ### TO DO ### 16 17 d e f c a l c _ l o g _ l i k e l i h o o d ( x, y, b e t a ) : 18 t h e t a _ h a t s = r u n _ l o g _ r e g _ m o d e l ( x, b e t a ) 19 ### R e t u r n an average, n o t a sum! 20 ### TO DO ### 21 22 d e f c a l c _ a c c u r a c y ( x, y, b e t a ) : 23 t h e t a _ h a t s = r u n _ l o g _ r e g _ m o d e l ( x, b e t a ) 24 ### TO DO ### 25 26 27 ### Model T r a i n i n g ### 28 29 d e f t r a i n _ l o g i s t i c _ r e g r e s s i o n _ m o d e l ( x, y, beta, l e a r n i n g _ r a t e, b a t c h _ s i z e, max_epoch ) : 30 b e t a = copy. deepcopy ( b e t a ) 31 n _ b a t c h e s = x. shape [ 0 ] / b a t c h _ s i z e 32 t r a i n _ p r o g r e s s = [ ] 33 34 f o r epoch_idx i n x r a n g e ( max_epoch ) : 35 f o r b a t c h _ i d x i n x r a n ge ( n _ b a t c h e s ) : 36 37 ### TO DO ### 38 39 # perform u p d a t e s 40 b e t a += l e a r n i n g _ r a t e b e t a _ g r a d 41 6

42 t r a i n _ p r o g r e s s. append ( c a l c _ l o g _ l i k e l i h o o d ( x, y, b e t a ) ) 43 p r i n t " Epoch %d. T r a i n Log L i k e l i h o o d : %f " %(epoch_idx, t r a i n _ p r o g r e s s [ 1]) 44 45 r e t u r n beta, t r a i n _ p r o g r e s s 46 47 48 i f name == " main " : 49 50 ### Load t h e d a t a 51 t r a i n _ x, t r a i n _ y = l o a d _ d a t a _ p a i r s ( " t r a i n " ) 52 v a l i d _ x, v a l i d _ y = l o a d _ d a t a _ p a i r s ( " v a l i d " ) 53 t e s t _ x, t e s t _ y = l o a d _ d a t a _ p a i r s ( " t e s t " ) 54 55 # add a one f o r t h e b i a s term 56 t r a i n _ x = np. h s t a c k ( [ t r a i n _ x, np. ones ( ( t r a i n _ x. shape [ 0 ], 1 ) ) ] ) 57 v a l i d _ x = np. h s t a c k ( [ v a l i d _ x, np. ones ( ( v a l i d _ x. shape [ 0 ], 1 ) ) ] ) 58 t e s t _ x = np. h s t a c k ( [ t e s t _ x, np. ones ( ( t e s t _ x. shape [ 0 ], 1 ) ) ] ) 59 60 ### I n i t i a l i z e model p a r a m e t e r s 61 b e t a = np. random. normal ( s c a l e =. 0 0 1, s i z e =( t r a i n _ x. shape [ 1 ], 1 ) ) 62 63 ### S e t t r a i n i n g p a r a m e t e r s 64 l e a r n i n g _ r a t e s = [1 e 3, 1e 2, 1e 1] 65 b a t c h _ s i z e s = [ t r a i n _ x. shape [ 0 ] ] 66 max_epochs = 250 67 68 ### I t e r a t e over t r a i n i n g p a r a m e t e r s, t e s t i n g a l l c o m b i n a t i o n s 69 v a l i d _ l l = [ ] 70 v a l i d _ a c c = [ ] 71 a l l _ p a r a m s = [ ] 72 a l l _ t r a i n _ l o g s = [ ] 73 74 f o r l r i n l e a r n i n g _ r a t e s : 75 f o r bs i n b a t c h _ s i z e s : 76 ### t r a i n model 77 f i n a l _ p a r a m s, t r a i n _ p r o g r e s s = t r a i n _ l o g i s t i c _ r e g r e s s i o n _ m o d e l ( t r a i n _ x, t r a i n _ y, beta, l r, bs, max_epochs ) 78 a l l _ p a r a m s. append ( f i n a l _ p a r a m s ) 79 a l l _ t r a i n _ l o g s. append ( ( t r a i n _ p r o g r e s s, " L e a r n i n g r a t e : %f, Batch s i z e : %d " %( l r, bs ) ) ) 80 81 ### e v a l u a t e model on v a l i d a t i o n d a t a 82 v a l i d _ l l. append ( c a l c _ l o g _ l i k e l i h o o d ( v a l i d _ x, v a l i d _ y, f i n a l _ p a r a m s ) ) 83 v a l i d _ a c c. append ( c a l c _ a c c u r a c y ( v a l i d _ x, v a l i d _ y, f i n a l _ p a r a m s ) ) 84 85 ### Get b e s t model 86 b e s t _ m o d e l _ i d x = np. argmax ( v a l i d _ a c c ) 87 b e s t _ p a r a m s = a l l _ p a r a m s [ b e s t _ m o d e l _ i d x ] 88 t e s t _ l l = c a l c _ l o g _ l i k e l i h o o d ( t e s t _ x, t e s t _ y, b e s t _ p a r a m s ) 89 t e s t _ a c c = c a l c _ a c c u r a c y ( t e s t _ x, t e s t _ y, b e s t _ p a r a m s ) 90 p r i n t " V a l i d a t i o n A c c u r a c i e s : "+ s t r ( v a l i d _ a c c ) 91 p r i n t " T e s t Accuracy : %f " %( t e s t _ a c c ) 92 93 ### P l o t 94 p l t. f i g u r e ( ) 95 epochs = r a n g e ( max_epochs ) 96 f o r idx, l o g i n enumerate ( a l l _ t r a i n _ l o g s ) : 97 p l t. p l o t ( epochs, l o g [ 0 ],, l i n e w i d t h =3, l a b e l =" T r a i n i n g, "+ l o g [ 1 ] ) 98 p l t. p l o t ( epochs, max_epochs [ v a l i d _ l l [ i d x ] ],, l i n e w i d t h =5, l a b e l =" V a l i d a t i o n, "+ l o g [ 1 ] ) 7

99 p l t. p l o t ( epochs, max_epochs [ t e s t _ l l ],, ms=8, l a b e l =" T e s t i n g, "+ a l l _ t r a i n _ l o g s [ b e s t _ m o d e l _ i d x ] [ 1 ] ) 100 101 p l t. x l a b e l ( r " Epoch ( $ t $ ) " ) 102 p l t. y l a b e l ( " Log L i k e l i h o o d " ) 103 p l t. ylim ( [. 8, 0. ] ) 104 p l t. t i t l e ( "MNIST R e s u l t s f o r V a r i o u s L o g i s t i c R e g r e s s i o n Models " ) 105 p l t. l e g e n d ( l o c =4) 106 p l t. show ( ) ADAM UPDATE 1 2 d e f get_adam_update ( alpha_0, grad, adam_values, b1 =. 9 5, b2 =. 9 9 9, e=1e 8) : 3 adam_values [ t ] += 1 4 5 # u p d a t e mean 6 adam_values [ mean ] = b1 adam_values [ mean ] + (1 b1 ) grad 7 m_hat = adam_values [ mean ] / (1 b1 adam_values [ t ] ) 8 9 # u p d a t e v a r i a n c e 10 adam_values [ v a r ] = b2 adam_values [ v a r ] + (1 b2 ) grad 2 11 v _ h a t = adam_values [ v a r ] / (1 b2 adam_values [ t ] ) 12 13 r e t u r n a l p h a _ 0 m_hat / ( np. s q r t ( v _ h a t ) + e ) 14 15 # I n i t i a l i z e a d i c t i o n a r y t h a t keeps t r a c k of t h e mean, v a r i a n c e, and u p d a t e c o u n t e r 16 a l p h a _ 0 = 1e 3 17 adam_values = \ 18 { mean : np. z e r o s ( b e t a. shape ), v a r : np. z e r o s ( b e t a. shape ), t : 0} 19 20 ### I n s i d e t h e t r a i n i n g loop do ### 21 b e t a _ g r a d = # compute g r a d i e n t w. r. t. t h e weight v e c t o r ( b e t a ) as u s u a l 22 b e t a _ u p d a t e = get_adam_update ( alpha_0, b e t a _ g r a d, adam_values ) 23 b e t a += b e t a _ u p d a t e 8