Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, PDF Free Download

Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018 Version 1 Machine Learning Linear Regression Mustafa Jarrar Birzeit University 1

Watch this lecture and download the slides Course Page: http://www.jarrar.info/courses/ai/ More Online Courses at: http://www.jarrar.info Acknowledgement: This lecture is based on (but not limited to) Andrew Ng s course about Machine Learning https://www.youtube.com/channel/ucmoxogx9mgrynewpciqucag 2

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 3

Motivation Given the following housing prices, Price (in 1000s of dollars) 230 1250 Size (feet 2 ) How much a house of 1250ft 2 costs? We may assume a linear curve, So, conclude that a house with 1250ft 2 costs 230K$. 4

Motivation Given the following housing prices Price (in 1000s of dollars) Supervised Learning: Given the right answers for each eample in the data (training data) Size (feet 2 ) Regression Problem: Predict real-valued output Remember that classification (not regression) refers to predicting discrete-valued output 5

Motivation Given the following training set of housing prices: Our job is to learn from this data how to predict prices Size in feet 2 () Price ($) in 1000s (y) 2104 460 1416 232 1534 315 852 178 Notation: m = Number of training eamples s = input variable/features y s = output variable/target variable (, y) : a training eample ( i, y i ) : the i th training eample For eample: 1 = 2104 y 1 = 460 6

Linear Regression Training Set How to represent h? Learning Algorithm y h () = 0 + 1 h() Size of house h Hypothesis Estimated Price This is a linear function. h maps s to y s Also called linear regression with one variable. 8

Linear Regression with Multiple Features Linear regression with multiple features is also called multiple linear regression Suppose we have the following features 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 A hypothesis function h() might be: h() = 0 + 1 1 + 2 2 + 3 3 + + n n or h() = 0 0 + 1 1 + 2 2 + 3 3 + + n n ( 0 =1) e.g., h() = 80 + 0.1 1 + 0.01 2 + 3 3-2 4 9

Polynomial Regression Suppose we have the following features 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 Another hypothesis function h() might be (polynomial): h () = 0 0 + 1 1 + 2 22 + 2 33 + + n n n ( 0 =1) 10

Polynomial Regression Price (y) h () = 0 + 1 + 2 2 h () = 0 + 1 + 2! h () = 0 + 1 + 2 2 + 3 3 Size () We may combine features (e.g., size = width * depth). We have the option of what features and what models (uadric, cubic, ) to use. Deciding which features and models that best fit our data and application, is beyond the scope of this course, but there are several algorithms for this. 11

Understanding s Values h () = 0 + 1 i s: Parameters How to choose i s? y 13

Understanding s Values h() = 1.5 + 0 h() =0 + 0.5 h() = 1 + 0.5 3 3 3 2 2 2 1 1 1 0 1 2 3 0 1 2 3 0 1 2 3 0 = 1.5 1 = 0 0 = 0 1 = 0.5 0 = 1 1 = 0.5 14

The Cost Function y Idea: Choose 0, 1 so that h() is close to y for our training eamples (,y). y: is the actual value. h(): is the estimated value. The Cost Function J aims to find 0, 1 that minimizes the error. 0, 1 m 1 J( ( h( 2m i ) y i ) 2 0, 1 ) = i=1 This cost function is also called a Suared Error function 15

3 2 1 0 Undersetting the Cost Function Lets try different values of s and see how the cost function J( 0, 1 ) behave! J(0,1) J(0,0.5) J(0,0) 1 2 J(1)= 1/2 3 [(0) 2 +(0) 2 (0) 2 ] = 0 Given our training data, this is the best value of 1. 3 3 2 1 0 1 2 J(0.5)= 1/2 3 [(0.5-1) 2 +(1-2) 2 (1.5-3) 2 ] 0.68 3 2 1 J( 1 ) 3 1 2 3 1 3 2 1 0 1 2 J(0)= 1/2 3 [(1) 2 +(2) 2 (3) 2 ] 2.3 But this figure plots only 1, the problem becomes complicated if we have also ( 0. 1) 3 16

Cost Function Intuition II Remember that our goal is find the minimum values of 0 and 1 Hypothesis: Parameters: h () = 0 + 1 0, 1 Cost Function: J( 0, 1 ) = m 1 ( h( 2m i ) y i ) 2 i=1 Our Goal: minimize J( 0, 1 ) 17

Cost Function Intuition II Given any dataset, when we try to draw the cost function J( 0, 1 ), we may get this 3D shape: Given a dataset, our goal is find the minimum values of J( 0, 1 ) 18

Cost Function Intuition II We may draw the cost function also using contour figures: 19

Cost Function Intuition II h () = 800 + 0.15 ( 0, 1 ) 20

Cost Function Intuition II h () = 360 + 0 ( 0, 1 ) 21

Cost Function Intuition II ( 0, 1 ) Is there any way/algorithm to find s automatically? Yes, e.g., the Gradient Descent Algorithm 22

The Gradient Descent Algorithm Starts with some initial values of ' 0 and ' 1 Keep changing ' 0 and ' 1 to reduce.(' 0, ' 1 ) Until hopefully we end up at a minimum Repeat until convergence {!"#$0 ' 0 α.!"#$1 '1 α. ' 0 : =!"#$0 + J is 0 or 1..(' +, 0, ' 1 ) - +..(' +, 0, ' 1 ) 3 } ' 1 : =!"#$1 Learning Rate Partial Derivation 24

The Problem of the Gradient Descent algo. 1 If α is too small, gradient descent can be slow 1 If α is too large, gradient decent can overshoot the minimum. It may fail to converge or even diverge May converge to global minimum May converge to a local minimum 25

The Gradient Descent for Linear Regression Simplified version for linear regression Repeat until convergence { J is 0 and 1! 0! 0 α. '. ( *+' ( (h(. * ) 0 * ) }! 1! 1 α. '. ( *+' ( (h(. * ) 0 * ).. * Net, we will try to use Linear Algebra to numerically minimize!2 (called Normal Euation) without needing to use iterating algorithms like Gradient Descent. However Gradient Descent scales better for bigger datasets. 26

Minimizing!" numerically Another way to numerically (using linear algebra) estimate the optimal values of!". Given the following features: 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 Convert the features and the target value into matries: X = 2104 5 1 45 1416 3 2 40 1534 3 2 30 852 2 1 36 y= 460 232 315 178 28

Minimizing!" numerically 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 add # 0, so to represent! 0 X = 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 y= 460 232 315 178 29

The Normal Euation To obtain the optimal values (minimized) of!s, Use the following Normal Euation:! = (X T X) -1 X T y Where:!: the set of!s we want to minimize X: the set of features in the training set y: the output values we try to predict X T : the transpose of X X -1 : the inverse of X In Octave: pinv(x *X)*X *y 30

Gradient Descent Vs Normal Euation m training eamples, n features Gradient Descent Needs to choose the learning rate (α) Needs many iterations Works well even when n is large Normal Euation No need to choose (α) Don t need to iterate Need to compute (X T X), which takes about O(n 3 ) Slow if n is very large Some matrices (e.g., singular) are non-invertible Recommendation: use the Normal Euation If the number of features in less than 1000, otherwise the Gradient Descent. 31

Matri Vector Multiplication 1 3 4 0 2 1 1 5 = 16 4 7 1 1 + 3 5 = 16 4 1 + 0 5 = 4 2 1 + 1 5 = 7 33

Matri Matri Multiplication 1 3 2 4 0 1 1 3 0 1 5 2 = 11 10 9 14 Multiple with the first column 1 3 2 4 0 1 1 0 5 = 11 9 Multiple with the second column 1 3 2 4 0 1 3 1 2 = 10 14 34

Matri Inverse If A is an m m matri, and if it has an inverse, then A A -1 = A -1 A = I The multiplication of a matri with its inverse produce an identity matri: 3 4 2 16 0.4-0.1-0.05 0.075 = 1 0 0 1 Some matrices do not have inverses e.g., if all cells are zeros For more, please review Linear Algebra 35

Matri Transpose Eample: A= 1 2 0 3 5 9 A T = 1 3 2 5 0 9 Let A be an m n matri, and let B = A T Then B is an n m mati, and B ij = A ji 36

Octave Download from https://www.gnu.org/software/octave/ High-level Scientific Programming Language Free alternatives to (and compatible with) Matlab helps in solving linear and nonlinear problems numerically Online Tutorial https://www.youtube.com/playlist?list=plnnr1o8owc6aajsc50lzzpvwgjzuczssd 38

Compute the normal euation using Octave Given X = 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 y= 460 232 315 178 In Octave: load X.tt load y.tt C= pinv(x *X)*X *y Save!.tt! = 188.4 0.4-56 -93-3.7 These are the values of!s we need to use in our hypothesis function h() h () = 0 + 1 + 2 2 + 3 3 + 4 4 h () = 188.4 + 0.4-56 2-93 3 3.7 4 39

R (programming language) Free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. R comes with many functions that can do sophisticated stuff, and the ability to install additional packages to do much more. Download R: https://cran.r-project.org/ A very good IDE for R is the RStudio: https://www.rstudio.com/products/rstudio/download/ R basics tutorial: https://www.youtube.com/playlist?list=pljgj6kdf_snybkiswqycyt UZiDpam7ygg 41

Compute the normal euation using R Given X = 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 y= 460 232 315 178 In R: X = as.matri(read.table("~/.tt", header=f, sep=",")) Y = as.matri(read.table("~/y.tt", header=f, sep=",")) thetas = solve( t(x) %*% X ) %*% t(x) %*% Y write.table(thetas, file="~/thetas.tt", row.names=f, col.names=f) t(x): is the transpose of matri X. solve(x): is the inverse of matri X. 42

References [1] Andrew Ng s course about Machine Learning https://www.youtube.com/channel/ucmoxogx9mgrynewpciqucag [2] Sami Ghawi, Mustafa Jarrar: Lecture Notes on Introduction to Machine Learning, Birzeit University, 2018 [3] Mustafa Jarrar: Lecture Notes on Decision Tree Machine Learning, Birzeit University, 2018 [4] Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning, Birzeit University, 2018 43

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018