Machine Learning: Homework 4

Similar documents
Machine Learning: Homework 5

Advanced Introduction to Machine Learning: Homework 2 MLE, MAP and Naive Bayes

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Hotelling s Two- Sample T 2

HOMEWORK 4: SVMS AND KERNELS

Feedback-error control

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

FA Homework 2 Recitation 2

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Introduction to Machine Learning

Machine Learning: Homework 3 Solutions

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

General Linear Model Introduction, Classes of Linear models and Estimation

arxiv: v1 [physics.data-an] 26 Oct 2012

1 A Support Vector Machine without Support Vectors

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Round-off Errors and Computer Arithmetic - (1.2)

E( x ) [b(n) - a(n, m)x(m) ]

UNIVERSITY OF KWAZULU-NATAL EXAMINATIONS: NOVEMBER 2013 COURSE, SUBJECT AND CODE: THEORY OF MACHINES ENME3TMH2 MECHANICAL ENGINEERING

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

2. Sample representativeness. That means some type of probability/random sampling.

ME scope Application Note 16

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

COMS F18 Homework 3 (due October 29, 2018)

Variable Selection and Model Building

LOGISTIC REGRESSION. VINAYANAND KANDALA M.Sc. (Agricultural Statistics), Roll No I.A.S.R.I, Library Avenue, New Delhi

Named Entity Recognition using Maximum Entropy Model SEEM5680

Linear classifiers: Logistic regression

Measuring center and spread for density curves. Calculating probabilities using the standard Normal Table (CIS Chapter 8, p 105 mainly p114)

Machine Learning, Fall 2012 Homework 2

The Poisson Regression Model

10701/15781 Machine Learning, Spring 2007: Homework 2

Machine Learning

Information collection on a graph

Classification with Perceptrons. Reading:

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

A New Perspective on Learning Linear Separators with Large L q L p Margins

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

AP Physics C: Electricity and Magnetism 2004 Scoring Guidelines

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Introduction to Optimization (Spring 2004) Midterm Solutions

Information collection on a graph

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

STK4900/ Lecture 7. Program

Homework 6: Image Completion using Mixture of Bernoullis

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2)

Measuring center and spread for density curves. Calculating probabilities using the standard Normal Table (CIS Chapter 8, p 105 mainly p114)

Stochastic Gradient Descent

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Estimation of the large covariance matrix with two-step monotone missing data

CSE 546 Final Exam, Autumn 2013

Bayesian Spatially Varying Coefficient Models in the Presence of Collinearity

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Probability Estimates for Multi-class Classification by Pairwise Coupling

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A Game Theoretic Investigation of Selection Methods in Two Population Coevolution

Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression

8.7 Associated and Non-associated Flow Rules

CSE 599d - Quantum Computing When Quantum Computers Fall Apart

Robustness of multiple comparisons against variance heterogeneity Dijkstra, J.B.

An Improved Generalized Estimation Procedure of Current Population Mean in Two-Occasion Successive Sampling

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

Statics and dynamics: some elementary concepts

HOMEWORK #4: LOGISTIC REGRESSION

Machine Learning: Assignment 1

Midterm: CS 6375 Spring 2015 Solutions

HW5 Solutions. Gabe Hope. March 2017

1 Machine Learning Concepts (16 points)

FE FORMULATIONS FOR PLASTICITY

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Generative v. Discriminative classifiers Intuition

ECE 5984: Introduction to Machine Learning

CS 340 Lec. 16: Logistic Regression

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014

Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform

Lasso Regularization for Generalized Linear Models in Base SAS Using Cyclical Coordinate Descent

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

Algorithms for Air Traffic Flow Management under Stochastic Environments

Highly improved convergence of the coupled-wave method for TM polarization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Ad Placement Strategies

Linear classifiers: Overfitting and regularization

Real Analysis 1 Fall Homework 3. a n.

Published: 14 October 2013

Linear and Logistic Regression

Radial Basis Function Networks: Algorithms

Chapter 2 Introductory Concepts of Wave Propagation Analysis in Structures

Experiment 1: Linear Regression

PARTIAL FACE RECOGNITION: A SPARSE REPRESENTATION-BASED APPROACH. Luoluo Liu, Trac D. Tran, and Sang Peter Chin

Linear Models in Machine Learning

Advanced Introduction to Machine Learning

PERFORMANCE BASED DESIGN SYSTEM FOR CONCRETE MIXTURE WITH MULTI-OPTIMIZING GENETIC ALGORITHM

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Transcription:

10-601 Machine Learning: Homework 4 Due 5.m. Monday, February 16, 2015 Instructions Late homework olicy: Homework is worth full credit if submitted before the due date, half credit during the next 48 hours, and zero credit after that. You must turn in at least n 1 of the n homeworks to ass the class, even if for zero credit. Collaboration olicy: Homeworks must be done individually, excet where otherwise noted in the assignments. Individually means each student must hand in their own answers, and each student must write and use their own code in the rogramming arts of the assignment. It is accetable for students to collaborate in figuring out answers and to hel each other solve the roblems, though you must in the end write u your own solutions individually, and you must list the names of students you discussed this with. We will be assuming that, as articiants in a graduate course, you will be taking the resonsibility to make sure you ersonally understand the solution to any work arising from such collaboration. Online submission: You must submit your solutions online on autolab. We recommend that you use L A TEX to tye your solutions to the written questions, but we will accet scanned solutions as well. On the Homework 4 autolab age, you can download the temlate, which is a tar archive containing a blank laceholder df for the written questions and one Octave source file for each of the rogramming questions. Relace each df file with one that contains your solutions to the written questions and fill in each of the Octave source files with your code. When you are ready to submit, create a new tar archive of the to-level directory and submit your archived solutions online by clicking the Submit File button. You should submit a single tar archive identical to the temlate, excet with each of the Octave source files filled in and with the blank df relaced by your solutions for the written questions. You are free to submit as many times as you like which is useful since you can see the autograder feedback immediately). DO NOT change the name of any of the files or folders in the submission temlate. In other words, your submitted files should have exactly the same names as those in the submission temlate. Do not modify the directory structure. Problem 1: Gradient Descent In class we derived a gradient descent learning algorithm for simle linear regression where we assumed y w 0 + w 1 x 1 + ɛ where ɛ N 0, σ 2 ) In this question, you will do the same for the following model y w 0 + w 1 x 1 + w 2 x 2 + w 3 x 2 1 + ɛ where ɛ N 0, σ 2 ) where your learning algorithm will estimate the arameters w 0, w 1, w 2, and w 3. a) [8 Points] Write down an exression for P y x 1, x 2 ). Solution: Since y x 1, x 2 N w 0 + w 1 x 1 + w 2 x 2 + w 3 x 2 1, σ 2 ), we have 1 y P y x 1, x 2 ) { ex w0 w 1 x 1 w 2 x 2 w 3 x 2 2 } 1) 2 2σ 2 1

b) [8 Points] Assume you are given a set of training observations x i) 1, xi) 2, yi) ) for i 1,..., n. Write down the conditional log likelihood of this training data. Dro any constants that do not deend on the arameters w 0,..., w 3. Solution: The conditional log likelihood is given by n lw 0, w 1, w 2, w 3 ) log P y i) x i) 1, xi) 2 ) log P y i) x i) 1, xi) 2 ) 1 y i) w 0 w 1 x i) log ex 2 2σ 2 1 )2 ) 2 n 1 log + y i) w 0 w 1 x i) log ex 2 2σ 2 1 log 1 2 2σ 2 y i) w 0 w 1 x i) 1 )2) 2 y i) w 0 w 1 x i) 1 )2) 2 1 )2 ) 2 c) [8 Points] Based on your answer above, write down a function fw 0, w 1, w 2, w 3 ) that can be minimized to find the desired arameter estimates. Solution: Since we want to maximize the conditional log likelihood, this is equivalent to minimizing the negative conditional log likelihood, which is given by fw 0, w 1, w 2, w 3 ) lw 0, w 1, w 2, w 3 ) y i) w 0 w 1 x i) 1 )2) 2 d) [8 Points] Calculate the gradient of fw) with resect to the arameter vector w [w 0, w 1, w 2, w 3 ] T. Hint: The gradient can be written as [ ] T fw) fw) fw) fw) w fw) w 0 w 1 w 2 w 3 Solution: The elements of the gradient of fw) are given by fw) w 0 fw) w 1 fw) w 2 fw) w 3 2 2 2 2 y i) w 0 w 1 x i) 1 )2) y i) w 0 w 1 x i) 1 )2) x i) 1 ) y i) w 0 w 1 x i) 1 )2) x i) 2 ) y i) w 0 w 1 x i) 1 )2) x i) 1 )2 2

e) [8 Points] Write down a gradient descent udate rule for w in terms of w fw). Solution: The gradient descent udate rule is w : w η w fw), where η is the ste size Problem 2: Logistic Regression In this question, you will imlement a logistic regression classifier and aly it to a two-class classification roblem. As in the last assignment, your code will be autograded by a series of test scrits that run on Autolab. To get started, download the temlate for Homework 4 from the Autolab website, and extract the contents of the comressed directory. In the archive, you will find one.m file for each of the functions that you are asked to imlement, along with a file called HW4Data.mat that contains the data for this roblem. You can load the data into Octave by executing load"hw4data.mat") in the Octave interreter. Make sure not to modify any of the function headers that are rovided. a) [5 Points] In logistic regression, our goal is to learn a set of arameters by maximizing the conditional log likelihood of the data. Assuming you are given a dataset with n training examles and features, write down a formula for the conditional log likelihood of the training data in terms of the the class labels y i), the features x i) 1,..., xi), and the arameters w 0, w 1,..., w, where the suerscrit i) denotes the samle index. This will be your obective function for gradient ascent. Solution: From the lecture slides, we know that the logistic regression conditional log likelihood is given by: lw 0, w 1,..., w ) log n P y i) x i) 1,..., xi) ; w 0, w 1,..., w ) y w i) 0 + 1 ) { w x i) log 1 + ex w 0 + 1 }) w x i) b) [5 Points] Comute the artial derivative of the obective function with resect to w 0 and with resect to an arbitrary w, i.e. derive f/ w 0 and f/ w, where f is the obective that you rovided above. Solution: The artial derivatives are given by: { f ex w 0 + } y i) 1 w x i) { w 0 1 + ex w 0 + } 1 w x i) { f ex w 0 + } x i) y i) 1 w x i) { w 1 + ex w 0 + } 1 w x i) c) [30 Points] Imlement a logistic regression classifier using gradient ascent by filling in the missing code for the following functions: Calculate the value of the obective function: ob LR CalcObXTrain,yTrain,wHat) Calculate the gradient: grad LR CalcGradXTrain,yTrain,wHat) Udate the arameter value: what LR UdateParamswHat,grad,eta) Check whether gradient ascent has converged: hasconverged LR CheckConvgoldOb,newOb,tol) Comlete the imlementation of gradient ascent: [what,obvals] LR GradientAscentXTrain,yTrain) Predict the labels for a set of test examles: [yhat,numerrors] LR PredictLabelsXTest,yTest,wHat) where the arguments and return values of each function are defined as follows: XTrain is an n dimensional matrix that contains one training instance er row 3

ytrain is an n 1 dimensional vector containing the class labels for each training instance what is a + 1 1 dimensional vector containing the regression arameter estimates ŵ 0, ŵ 1,..., ŵ grad is a + 1 1 dimensional vector containing the value of the gradient of the obective function with resect to each arameter in what eta is the gradient ascent ste size that you should set to eta 0.01 ob, oldob and newob are values of the obective function tol is the convergence tolerance, which you should set to tol 0.001 obvals is a vector containing the obective value at each iteration of gradient ascent XTest is an m dimensional matrix that contains one test instance er row ytest is an m 1 dimensional vector containing the true class labels for each test instance yhat is an m 1 dimensional vector containing your redicted class labels for each test instance numerrors is the number of misclassified examles, i.e. the differences between yhat and ytest To comlete the LR GradientAscent function, you should use the heler functions LR CalcOb, LR CalcGrad, LR UdateParams, and LR CheckConvg. Solution: See the functions LR CalcOb, LR CalcGrad, LR UdateParams, LR CheckConvg, LR GradientAscent, and LR PredictLabels in the solution code d) [0 Points] Train your logistic regression classifier on the data rovided in XTrain and ytrain with LR GradientAscent, and then use your estimated arameters what to calculate redicted labels for the data in XTest with LR PredictLabels. Solution: See the function RunLR in the solution code e) [5 Points] Reort the number of misclassified examles in the test set. Solution: There are 13 misclassified examles in the test set f) [5 Points] Plot the value of the obective function on each iteration of gradient ascent, with the iteration number on the horizontal axis and the obective value on the vertical axis. Make sure to include axis labels and a title for your lot. Reort the number of iterations that are required for the algorithm to converge. Solution: See Figure 1. The algorithm converges after 87 iterations. g) [10 Points] Next, you will evaluate how the training and test error change as the training set size increases. For each value of k in the set {10, 20, 30,..., 480, 490, 500}, first choose a random subset of the training data of size k using the following code: subsetinds randermn, k) XTrainSubset XTrainsubsetInds, :) ytrainsubset ytrainsubsetinds) Then re-train your classifier using XTrainSubset and ytrainsubset, and use the estimated arameters to calculate the number of misclassified examles on both the training set XTrainSubset and ytrainsubset and on the original test set XTest and ytest. Finally, generate a lot with two lines: in blue, lot the value of the training error against k, and in red, ot the value of the test error against k, where the error should be on the vertical axis and training set size should be on the horizontal axis. Make sure to include a legend in your lot to label the two lines. Describe what haens to the training and test error as the training set size increases, and rovide an exlanation for why this behavior occurs. Solution: See Figure 2. As the training set size increases, test error decreases but training error increases. This attern becomes even more evident when we erform the same exeriment using multile random subsamles for each training set size, and calculate the average training and test error over these samles, the result of which is shown in Figure 3. 4

Figure 1: Solution to Problem 2f). When the training set size is small, the logistic regression model is often caable of erfectly classifying the training data since it has relatively little variation. This is why the training error is close to zero. However, such a model has oor generalization ability because its estimate of what is based on a samle that is not reresentative of the true oulation from which the data is drawn. This henomenon is known as overfitting because the model fits too closely to the training data. As the training set size increases, more variation is introduced into the training data, and the model is usually no longer able to fit to the training set as well. This is also due to the fact that the comlete dataset is not 100% linearly searable. At the same time, more training data rovides the model with a more comlete icture of the overall oulation, which allows it to learn a more accurate estimate of what. This in turn leads to better generalization ability i.e. lower rediction error on the test dataset. Extra Credit: h) [5 Points] Based on the logistic regression formula you learned in class, derive the analytical exression for the decision boundary of the classifier in terms of w 0, w 1,..., w and x 1,..., x. What can you say about the shae of the decision boundary? Solution: The analytical formula for the decision boundary is given by w 0 + 1 w x 0. This is the equation for a hyerlane in R, which indicates that the decision boundary is linear. i) [5 Points] In this art, you will lot the decision boundary roduced by your classifier. First, create a two-dimensional scatter lot of your test data by choosing the two features that have highest absolute weight in your estimated arameters what let s call them features and k), and lotting the -th dimension stored in XTest:,) on the horizontal axis and the k-th dimension stored in XTest:,k) on the vertical axis. Color each oint on the lot so that examles with true label y 1 are shown in blue and label y 0 are shown in red. Next, using the formula that you derived in art h), lot the decision boundary of your classifier in black on the same figure, again considering only dimensions and k. Solution: See the function PlotDB in the solution code. See Figure 4. 5

Figure 2: Solution to Problem 2g). Figure 3: Additional solution to Problem 2g). 6

Figure 4: Solution to Problem 2i). 7