Experiment 1: Linear Regression

Similar documents
Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Lecture 2: Linear regression

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Vector Fields and Solutions to Ordinary Differential Equations using MATLAB/Octave

Lecture 5 Multivariate Linear Regression

Section 20: Arrow Diagrams on the Integers

LINEAR REGRESSION, RIDGE, LASSO, SVR

Stanford Machine Learning - Week V

CSC321 Lecture 2: Linear Regression

Machine Learning. Lecture 2: Linear regression. Feng Li.

Lecture 11 Linear regression

New Mexico Tech Hyd 510

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Optimization and Gradient Descent

Math 343 Lab 7: Line and Curve Fitting

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

MAT 275 Laboratory 4 MATLAB solvers for First-Order IVP

Lecture 3: Special Matrices

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Lecture 15: Exploding and Vanishing Gradients

An introduction to plotting data

Lab 2: Static Response, Cantilevered Beam

Statistical methods. Mean value and standard deviations Standard statistical distributions Linear systems Matrix algebra

MAT 275 Laboratory 4 MATLAB solvers for First-Order IVP

Least Mean Squares Regression. Machine Learning Fall 2018

Training the linear classifier

Physics with Matlab and Mathematica Exercise #1 28 Aug 2012

Math 409/509 (Spring 2011)

MAT 275 Laboratory 4 MATLAB solvers for First-Order IVP

Matrix-Vector Operations

Temperature measurement

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Generative Learning algorithms

CSC321 Lecture 8: Optimization

Linear and Logistic Regression

CS229 Lecture notes. Andrew Ng

Lecture 4: Training a Classifier

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSC321 Lecture 15: Exploding and Vanishing Gradients

Matrix-Vector Operations

Lecture 4: Training a Classifier

HOMEWORK #4: LOGISTIC REGRESSION

4 Gaussian Mixture Models

Machine Learning 4771

CPSC 540: Machine Learning

Naive Bayes & Introduction to Gaussians

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Ordinary Least Squares Linear Regression

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

2 Solving Ordinary Differential Equations Using MATLAB

CSC 411 Lecture 6: Linear Regression

Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem.

CSC321 Lecture 7: Optimization

Homework 5. Convex Optimization /36-725

Solving Equations by Factoring. Solve the quadratic equation x 2 16 by factoring. We write the equation in standard form: x

HOMEWORK #4: LOGISTIC REGRESSION

CS229 Supplemental Lecture notes

Least Mean Squares Regression

Error Functions & Linear Regression (1)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

6.094 Introduction to MATLAB January (IAP) 2009

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

22 Approximations - the method of least squares (1)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Lecture 1: Supervised Learning

Advanced Topics in Machine Learning, Summer Semester 2012

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 9: June 6, Abstract. Regression and R.

Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013

EOSC 110 Reading Week Activity, February Visible Geology: Building structural geology skills by exploring 3D models online

CSC 411: Lecture 02: Linear Regression

Main topics for the First Midterm Exam

Math 5a Reading Assignments for Sections

Introduction to Logistic Regression

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Neural Networks

VELA. Getting started with the VELA Versatile Laboratory Aid. Paul Vernon

CSCE 155N Fall Homework Assignment 2: Stress-Strain Curve. Assigned: September 11, 2012 Due: October 02, 2012

6.036: Midterm, Spring Solutions

An Introduction to Matlab

Introduction to Machine Learning HW6

CPSC 340 Assignment 4 (due November 17 ATE)

SOLUTIONS to Exercises from Optimization

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

Binary Classification / Perceptron

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

CMU-Q Lecture 24:

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Week 4: Hopfield Network

Vector Fields and Solutions to Ordinary Differential Equations using Octave

Newton s Cooling Model in Matlab and the Cooling Project!

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Solving Differential Equations on 2-D Geometries with Matlab

When a function is defined by a fraction, the denominator of that fraction cannot be equal to zero

Project Two. James K. Peterson. March 26, Department of Biological Sciences and Department of Mathematical Sciences Clemson University

Multiple Representations: Equations to Tables and Graphs Transcript

Transcription:

Experiment 1: Linear Regression August 27, 2018 1 Description This first exercise will give you practice with linear regression. These exercises have been extensively tested with Matlab, but they should also work in Octave, which has been called a free version of Matlab. If you are using Octave, be sure to install the Image package as well (available for Windows as an option in the installer, and available for Linux from Octave-Forge ). 2 Linear Regression Recall that the linear regression model is h θ (x) = θ T x = n θ j x j, (1) where θ is the parameter which we need to optimize and x is the (n + 1)- dimensional feature vector 1. Given a training set {x (i) } i=1,,m, our goal is to find the optimal value of θ such that the objective function J(θ), as shown in Equation (2), can be minimized. J(θ) = 1 2m j=0 m (h θ (x (i) ) y (i) ) 2 (2) i=1 One of the optimization approach is gradient descent algorithm. The algorithm is performed iteratively, and in each iteration, we update parameter θ according to the the following rule θ j := θ j α 1 m m (h θ (x (i) ) y (i) )x (i) j (3) i=1 where α is so-called learning rate based on which we can tune the convergence of the gradient descent. 1 A training data is actually n-dimensional, i.e., x = [x 1, x 2,, x n]. For each training data, we have an extra intercept item x 0 = 1. Therefore, the resulting feature vector is (n + 1)-dimensional. 1

Height in meters 3 2D Linear Regression We start a very simple case where n = 1. Download data1.zip, and extract the files (ex1x.dat and ex1y.dat) from the zip file. The files contain some example measurements of heights for various boys between the ages of two and eights. The y-values are the heights measured in meters, and the x-values are the ages of the boys corresponding to the heights. Each height and age tuple constitutes one training example (x (i), y (i) in our dataset. There are m = 50 training examples, and you will use them to develop a linear regression model using gradient descent algorithm, based on which, we can predict the height given a new age value. In Matlab/Octave, you can load the training set using the commands x = load ( ex1x. dat ) ; y = load ( ex1y. dat ) ; This will be our training set for a supervised learning problem with n = 1 features ( in addition to the usual x 0 = 1, so x R 2 ). If you re using Matlab/Octave, run the following commands to plot your training set (and label the axes): figure % open a new f i g u r e window plot ( x, y, o ) ; ylabel ( Height i n meters ) xlabel ( Age i n y e a r s ) You should see a series of data points similar to Fig. 1. 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 2 3 4 5 6 7 8 Age in years Figure 1: Plotting the data. Before starting gradient descent, we need to add the x 0 = 1 intercept term to every example. To do this in Matlab/Octave, the command is m = length ( y ) ; % s t o r e t h e number o f t r a i n i n g examples x = [ ones (m, 1 ), x ] ; % Add a column o f ones to x 2

From this point on, you will need to remember that the age values from your training data are actually in the second column of x. This will be important when plotting your results later. We implement linear regression for this problem. The linear regression model in this case is 1 h θ (x) = θ T x = θ i x i = θ 1 x 1 + θ 2, (4) i=0 (1) Implement gradient descent using a learning rate of α = 0.07. Initialize the parameters to θ = 0 (i.e., θ 0 = θ 1 = 0), and run one iteration of gradient descent from this initial starting point. Record the value of of θ 0 and θ 1 that you get after this first iteration. (2) Continue running gradient descent for more iterations until θ converges. (this will take a total of about 1500 iterations). After convergence, record the final values of θ 0 and θ 1 that you get, and plot the straight line fit from your algorithm on the same graph as your training data according to θ. The plotting commands will look something like this: hold on % P l o t new data w i t h o u t c l e a r i n g o l d p l o t plot ( x ( :, 2 ), x theta, ) % remember t h a t x i s now a matrix % with 2 columnsand t h e second % column c o n t a i n s t h e time i n f o legend ( Training data, Linear r e g r e s s i o n ) Note that for most machine learning problems, x is very high dimensional, so we don t be able to plot h θ (x). But since in this example we have only one feature, being able to plot this gives a nice sanity-check on our result. (3) Finally, we d like to make some predictions using the learned hypothesis. Use your model to predict the height for two boys of ages 3.5 and 7. 4 Understanding J(θ) We d like to understand better what gradient descent has done, and visualize the relationship between the parameters θ R 2 and J(θ). In this problem, we ll plot J(θ) as a 3D surface plot. (When applying learning algorithms, we don t usually try to plot J(θ) since usually θ R n is very high-dimensional so that we don t have any simple way to plot or visualize J(θ). But because the example here uses a very low dimensional θ R 2, we ll plot J(θ) to gain more intuition about linear regression.) To get the best viewing results on your surface plot, use the range of theta values that we suggest in the code skeleton below. J v a l s = zeros ( 1 0 0, 1 0 0 ) ; % i n i t i a l i z e J v a l s to % 100 100 matrix o f 0 s t h e t a 0 v a l s = linspace ( 3, 3, 1 0 0 ) ; t h e t a 1 v a l s = linspace ( 1, 1, 1 0 0 ) ; for i = 1 : length ( t h e t a 0 v a l s ) for j = 1 : length ( t h e t a 1 v a l s ) t = [ t h e t a 0 v a l s ( i ) ; t h e t a 1 v a l s ( j ) ] ; J v a l s ( i, j ) = %% YOUR CODE HERE %% end end 3

Figure 2: The relationship between J and θ % P l o t t h e s u r f a c e p l o t % Because o f t h e way meshgrids work in t h e s u r f command, we % need to t r a n s p o s e J v a l s b e f o r e c a l l i n g s u r f, or e l s e t h e % axes w i l l be f l i p p e d J v a l s = J v a l s figure ; surf ( t h e t a 0 v a l s, t h e t a 1 v a l s, J v a l s ) xlabel ( \ t h e t a 0 ) ; ylabel ( \ t h e t a 1 ) You should get a figure similar to Fig. 2. If you are using Matlab/Octave, you can use the orbit tool to view this plot from different viewpoints. What is the relationship between this 3D surface and the value of θ 0 and θ 1 that your implementation of gradient descent had found? Visualize the relationship by both surf and contour commands. Remarks: For the surf function surf(x, y, z), if x and y are vectors, x = 1 : columns(z) and y = 1 : rows(z). Therefore, z(i, j) is actually calculated based on x(j) and y(i). This rule is also applicable to the contour function. We can specify the number and the distribution of contours in the contour function, by introduction different spaced vector, e.g., linearly spaced vector (linspace) and logarithmically spaced vector (logspace). Try both in this exercises and select the better one to improve the illustration. 5 Multivariate Linear Regression We now look at a more complicated case where each training data contains multiple features. Download data1.zip, and extract the files (ex2x.dat and ex2y.dat) from the zip file. This is a training set of housing prices in Portland, Oregon, where the outputs y s are the prices and the inputs x s are the living area and the number of bedrooms. There are m = 47 training examples. Take a look at the values of the inputs x (i) and note that the living areas 4

are about 1000 times the number of bedrooms. This difference means that preprocessing the inputs will significantly increase gradient descent s efficiency. In your program, scale both types of inputs by their standard deviations and set their means to zero. In Matlab/Octave, this can be executed with sigma = std ( x ) ; mu = mean( x ) ; x ( :, 2 ) = ( x ( :, 2 ) mu( 2 ) ). / sigma ( 2 ) ; x ( :, 3 ) = ( x ( :, 3 ) mu( 3 ) ). / sigma ( 3 ) ; 5.1 Selecting A Learning Rate Using J(θ) Now it s time to select a learning rate α. The goal of this part is to pick a good learning rate in the range of 0.001 α 10 You will do this by making an initial selection, running gradient descent and observing the cost function, and adjusting the learning rate accordingly. Recall that the cost function is defined in Equation (2). The cost function can also be written in the following vectorized form, where J(θ) = 1 2m (Xθ y)t (Xθ y) y (1) (x (1) ) T y (2) y =., X = (x (2) ) T. y (m) (x (m) ) T The vectorized version is useful and efficient when you re working with numerical computing tools like Matlab/Octave. If you are familiar with matrices, you can prove to yourself that the two forms are equivalent. While in the previous exercise you calculated J(θ) over a grid of θ 0 and θ 1 values, you will now calculate J(θ) using the θ of the current stage of gradient descent. After stepping through many stages, you will see how J(θ) changes as the iterations advance. Now, run gradient descent for about 50 iterations at your initial learning rate. In each iteration, calculate J(θ) and store the result in a vector J. After the last iteration, plot the J values against the number of the iteration. In Matlab/Octave, the steps would look something like this: t h e t a = zeros ( size ( x ( 1, : ) ) ) ; % i n i t i a l i z e f i t t i n g parameters alpha = %% Your i n i t i a l l e a r n i n g r a t e %% J = zeros ( 5 0, 1 ) ; for n u m i t e r a t i o n s = 1 : 5 0 J ( n u m i t e r a t i o n s ) = %% C a l c u l a t e your c o s t f u n c t i o n here %% t h e t a = %% R e s u l t o f g r a d i e n t d e s c e n t update %% end % now p l o t J % t e c h n i c a l l y, t h e f i r s t J s t a r t s at t h e zero e t h i t e r a t i o n % but Matlab / Octave doesn t have a zero index 5

Cost J figure ; plot ( 0 : 4 9, J ( 1 : 5 0 ), ) xlabel ( Number o f i t e r a t i o n s ) ylabel ( Cost J ) If you picked a learning rate within a good range, your plot should appear like the figure below. 7 10 10 6 5 4 3 2 1 0 0 10 20 30 40 50 Number of iterations If your graph looks very different, especially if your value of J(θ) increases or even blows up, adjust your learning rate and try again. We recommend testing alphas at a rate of of 3 times the next smallest value (i.e. 0.01, 0.03, 0.1, 0.3 and so on). You may also want to adjust the number of iterations you are running if that will help you see the overall trend in the curve. To compare how different learning learning rates affect convergence, it s helpful to plot J for several learning rates on the same graph. In Matlab/Octave, this can be done by performing gradient descent multiple times with a hold on command between plots. Concretely, if you ve tried three different values of alpha (you should probably try more values than this) and stored the costs in J1, J2 and J3, you can use the following commands to plot them on the same figure: plot ( 0 : 4 9, J1 ( 1 : 5 0 ), b ) ; hold on ; plot ( 0 : 4 9, J2 ( 1 : 5 0 ), r ) ; plot ( 0 : 4 9, J3 ( 1 : 5 0 ), k ) ; The final arguments b-, r-, and k- specify different plot styles for the plots. Type help plot at the Matlab/Octave command line for more information on plot styles. Answer the following questions: 1. Observe the changes in the cost function happens as the learning rate changes. What happens when the learning rate is too small? Too large? 2. Using the best learning rate that you found, run gradient descent until convergence to find 6

(a) The final values of θ (b) The predicted price of a house with 1650 square feet and 3 bedrooms. Don t forget to scale your features when you make this prediction! 7