Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Similar documents
Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Experiment 1: Linear Regression

Lecture 11 Linear regression

Machine Learning 4771

Least Mean Squares Regression

CSC321 Lecture 2: Linear Regression

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Least Mean Squares Regression. Machine Learning Fall 2018

Regression with Numerical Optimization. Logistic

Propositional Logic Introduction and Basics. 2.2 Conditional Statements. 2.3 Inferencing. mjarrar 2015

Gradient Descent. Stephan Robert

Lecture 2: Linear regression

CSC 411 Lecture 6: Linear Regression

Linear Regression. S. Sumitra

LINEAR REGRESSION, RIDGE, LASSO, SVR

Machine Learning. Regression. Manfred Huber

Error Functions & Linear Regression (1)

Gradient Descent. Model

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CSC 411: Lecture 02: Linear Regression

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression

MATH36061 Convex Optimization

Linear and Logistic Regression. Dr. Xiaowei Huang

Overview. Linear Regression with One Variable Gradient Descent Linear Regression with Multiple Variables Gradient Descent with Multiple Variables

Neural Networks and Deep Learning

CSCI567 Machine Learning (Fall 2014)

CMU-Q Lecture 24:

6.036 midterm review. Wednesday, March 18, 15

Sample questions for Fundamentals of Machine Learning 2018

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC242: Intro to AI. Lecture 21

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Linear Models for Regression

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Linear Regression (continued)

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning: Assignment 1

CPSC 340: Machine Learning and Data Mining

Machine Learning Linear Models

ECE521 Lectures 9 Fully Connected Neural Networks

Manual for a computer class in ML

Introduction to Logistic Regression and Support Vector Machine

Set Theory Basics of Set Theory. 6.2 Properties of Sets and Element Argument. 6.3 Algebraic Proofs 6.4 Boolean Algebras.

CSC321 Lecture 5 Learning in a Single Neuron

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Stanford Machine Learning - Week V

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Midterm, Fall 2003

Advanced Introduction to Machine Learning

Midterm: CS 6375 Spring 2015 Solutions

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning, Fall 2012 Homework 2

Optimization for Machine Learning

Linear Models for Regression CS534

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Neural Networks 2

An introduction to plotting data

Answers Machine Learning Exercises 4

10-701/ Machine Learning - Midterm Exam, Fall 2010

Bias-Variance Tradeoff

10701/15781 Machine Learning, Spring 2007: Homework 2

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Machine learning for pervasive systems Classification in high-dimensional spaces

Cross-validation for detecting and preventing overfitting

CSE446: non-parametric methods Spring 2017

Classification with Perceptrons. Reading:

CS 6375 Machine Learning

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Linear Models for Regression CS534

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Number Theory and Proof Methods

CSC 411: Lecture 04: Logistic Regression

ML (cont.): SUPPORT VECTOR MACHINES

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

ECE 5984: Introduction to Machine Learning

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Overfitting, Bias / Variance Analysis

Machine Learning Lecture 7

Lecture 4 Logistic Regression

ECE521 week 3: 23/26 January 2017

Linear Discriminant Functions

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Multilayer Neural Networks

Multilayer Neural Networks

ECS171: Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Linear Models for Regression CS534

CE213 Artificial Intelligence Lecture 14

Instructor (Andrew Ng)

Gradient Descent. Dr. Xiaowei Huang

Transcription:

Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018 Version 1 Machine Learning Linear Regression Mustafa Jarrar Birzeit University 1

Watch this lecture and download the slides Course Page: http://www.jarrar.info/courses/ai/ More Online Courses at: http://www.jarrar.info Acknowledgement: This lecture is based on (but not limited to) Andrew Ng s course about Machine Learning https://www.youtube.com/channel/ucmoxogx9mgrynewpciqucag 2

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 3

Motivation Given the following housing prices, Price (in 1000s of dollars) 230 1250 Size (feet 2 ) How much a house of 1250ft 2 costs? We may assume a linear curve, So, conclude that a house with 1250ft 2 costs 230K$. 4

Motivation Given the following housing prices Price (in 1000s of dollars) Supervised Learning: Given the right answers for each eample in the data (training data) Size (feet 2 ) Regression Problem: Predict real-valued output Remember that classification (not regression) refers to predicting discrete-valued output 5

Motivation Given the following training set of housing prices: Our job is to learn from this data how to predict prices Size in feet 2 () Price ($) in 1000s (y) 2104 460 1416 232 1534 315 852 178 Notation: m = Number of training eamples s = input variable/features y s = output variable/target variable (, y) : a training eample ( i, y i ) : the i th training eample For eample: 1 = 2104 y 1 = 460 6

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 7

Linear Regression Training Set How to represent h? Learning Algorithm y h () = 0 + 1 h() Size of house h Hypothesis Estimated Price This is a linear function. h maps s to y s Also called linear regression with one variable. 8

Linear Regression with Multiple Features Linear regression with multiple features is also called multiple linear regression Suppose we have the following features 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 A hypothesis function h() might be: h() = 0 + 1 1 + 2 2 + 3 3 + + n n or h() = 0 0 + 1 1 + 2 2 + 3 3 + + n n ( 0 =1) e.g., h() = 80 + 0.1 1 + 0.01 2 + 3 3-2 4 9

Polynomial Regression Suppose we have the following features 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 Another hypothesis function h() might be (polynomial): h () = 0 0 + 1 1 + 2 22 + 2 33 + + n n n ( 0 =1) 10

Polynomial Regression Price (y) h () = 0 + 1 + 2 2 h () = 0 + 1 + 2! h () = 0 + 1 + 2 2 + 3 3 Size () We may combine features (e.g., size = width * depth). We have the option of what features and what models (uadric, cubic, ) to use. Deciding which features and models that best fit our data and application, is beyond the scope of this course, but there are several algorithms for this. 11

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 12

Understanding s Values h () = 0 + 1 i s: Parameters How to choose i s? y 13

Understanding s Values h() = 1.5 + 0 h() =0 + 0.5 h() = 1 + 0.5 3 3 3 2 2 2 1 1 1 0 1 2 3 0 1 2 3 0 1 2 3 0 = 1.5 1 = 0 0 = 0 1 = 0.5 0 = 1 1 = 0.5 14

The Cost Function y Idea: Choose 0, 1 so that h() is close to y for our training eamples (,y). y: is the actual value. h(): is the estimated value. The Cost Function J aims to find 0, 1 that minimizes the error. 0, 1 m 1 J( ( h( 2m i ) y i ) 2 0, 1 ) = i=1 This cost function is also called a Suared Error function 15

3 2 1 0 Undersetting the Cost Function Lets try different values of s and see how the cost function J( 0, 1 ) behave! J(0,1) J(0,0.5) J(0,0) 1 2 J(1)= 1/2 3 [(0) 2 +(0) 2 (0) 2 ] = 0 Given our training data, this is the best value of 1. 3 3 2 1 0 1 2 J(0.5)= 1/2 3 [(0.5-1) 2 +(1-2) 2 (1.5-3) 2 ] 0.68 3 2 1 J( 1 ) 3 1 2 3 1 3 2 1 0 1 2 J(0)= 1/2 3 [(1) 2 +(2) 2 (3) 2 ] 2.3 But this figure plots only 1, the problem becomes complicated if we have also ( 0. 1) 3 16

Cost Function Intuition II Remember that our goal is find the minimum values of 0 and 1 Hypothesis: Parameters: h () = 0 + 1 0, 1 Cost Function: J( 0, 1 ) = m 1 ( h( 2m i ) y i ) 2 i=1 Our Goal: minimize J( 0, 1 ) 17

Cost Function Intuition II Given any dataset, when we try to draw the cost function J( 0, 1 ), we may get this 3D shape: Given a dataset, our goal is find the minimum values of J( 0, 1 ) 18

Cost Function Intuition II We may draw the cost function also using contour figures: 19

Cost Function Intuition II h () = 800 + 0.15 ( 0, 1 ) 20

Cost Function Intuition II h () = 360 + 0 ( 0, 1 ) 21

Cost Function Intuition II ( 0, 1 ) Is there any way/algorithm to find s automatically? Yes, e.g., the Gradient Descent Algorithm 22

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 23

The Gradient Descent Algorithm Starts with some initial values of ' 0 and ' 1 Keep changing ' 0 and ' 1 to reduce.(' 0, ' 1 ) Until hopefully we end up at a minimum Repeat until convergence {!"#$0 ' 0 α.!"#$1 '1 α. ' 0 : =!"#$0 + J is 0 or 1..(' +, 0, ' 1 ) - +..(' +, 0, ' 1 ) 3 } ' 1 : =!"#$1 Learning Rate Partial Derivation 24

The Problem of the Gradient Descent algo. 1 If α is too small, gradient descent can be slow 1 If α is too large, gradient decent can overshoot the minimum. It may fail to converge or even diverge May converge to global minimum May converge to a local minimum 25

The Gradient Descent for Linear Regression Simplified version for linear regression Repeat until convergence { J is 0 and 1! 0! 0 α. '. ( *+' ( (h(. * ) 0 * ) }! 1! 1 α. '. ( *+' ( (h(. * ) 0 * ).. * Net, we will try to use Linear Algebra to numerically minimize!2 (called Normal Euation) without needing to use iterating algorithms like Gradient Descent. However Gradient Descent scales better for bigger datasets. 26

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 27

Minimizing!" numerically Another way to numerically (using linear algebra) estimate the optimal values of!". Given the following features: 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 Convert the features and the target value into matries: X = 2104 5 1 45 1416 3 2 40 1534 3 2 30 852 2 1 36 y= 460 232 315 178 28

Minimizing!" numerically 1 2 3 4 y Size ft 2 bedrooms floors Age Price 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 add # 0, so to represent! 0 X = 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 y= 460 232 315 178 29

The Normal Euation To obtain the optimal values (minimized) of!s, Use the following Normal Euation:! = (X T X) -1 X T y Where:!: the set of!s we want to minimize X: the set of features in the training set y: the output values we try to predict X T : the transpose of X X -1 : the inverse of X In Octave: pinv(x *X)*X *y 30

Gradient Descent Vs Normal Euation m training eamples, n features Gradient Descent Needs to choose the learning rate (α) Needs many iterations Works well even when n is large Normal Euation No need to choose (α) Don t need to iterate Need to compute (X T X), which takes about O(n 3 ) Slow if n is very large Some matrices (e.g., singular) are non-invertible Recommendation: use the Normal Euation If the number of features in less than 1000, otherwise the Gradient Descent. 31

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 32

Matri Vector Multiplication 1 3 4 0 2 1 1 5 = 16 4 7 1 1 + 3 5 = 16 4 1 + 0 5 = 4 2 1 + 1 5 = 7 33

Matri Matri Multiplication 1 3 2 4 0 1 1 3 0 1 5 2 = 11 10 9 14 Multiple with the first column 1 3 2 4 0 1 1 0 5 = 11 9 Multiple with the second column 1 3 2 4 0 1 3 1 2 = 10 14 34

Matri Inverse If A is an m m matri, and if it has an inverse, then A A -1 = A -1 A = I The multiplication of a matri with its inverse produce an identity matri: 3 4 2 16 0.4-0.1-0.05 0.075 = 1 0 0 1 Some matrices do not have inverses e.g., if all cells are zeros For more, please review Linear Algebra 35

Matri Transpose Eample: A= 1 2 0 3 5 9 A T = 1 3 2 5 0 9 Let A be an m n matri, and let B = A T Then B is an n m mati, and B ij = A ji 36

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 37

Octave Download from https://www.gnu.org/software/octave/ High-level Scientific Programming Language Free alternatives to (and compatible with) Matlab helps in solving linear and nonlinear problems numerically Online Tutorial https://www.youtube.com/playlist?list=plnnr1o8owc6aajsc50lzzpvwgjzuczssd 38

Compute the normal euation using Octave Given X = 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 y= 460 232 315 178 In Octave: load X.tt load y.tt C= pinv(x *X)*X *y Save!.tt! = 188.4 0.4-56 -93-3.7 These are the values of!s we need to use in our hypothesis function h() h () = 0 + 1 + 2 2 + 3 3 + 4 4 h () = 188.4 + 0.4-56 2-93 3 3.7 4 39

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018 Machine Learning Linear Regression In this lecture: Part 1: Motivation (Regression Problems) Part 2: Linear Regression Part 3: The Cost Function Part 4: The Gradient Descent Algorithm Part 5: The Normal Euation Part 6: Linear Algebra overview Part 7: Using Octave Part 8: Using R 40

R (programming language) Free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. R comes with many functions that can do sophisticated stuff, and the ability to install additional packages to do much more. Download R: https://cran.r-project.org/ A very good IDE for R is the RStudio: https://www.rstudio.com/products/rstudio/download/ R basics tutorial: https://www.youtube.com/playlist?list=pljgj6kdf_snybkiswqycyt UZiDpam7ygg 41

Compute the normal euation using R Given X = 1 2104 5 1 45 1 1416 3 2 40 1 1534 3 2 30 1 852 2 1 36 y= 460 232 315 178 In R: X = as.matri(read.table("~/.tt", header=f, sep=",")) Y = as.matri(read.table("~/y.tt", header=f, sep=",")) thetas = solve( t(x) %*% X ) %*% t(x) %*% Y write.table(thetas, file="~/thetas.tt", row.names=f, col.names=f) t(x): is the transpose of matri X. solve(x): is the inverse of matri X. 42

References [1] Andrew Ng s course about Machine Learning https://www.youtube.com/channel/ucmoxogx9mgrynewpciqucag [2] Sami Ghawi, Mustafa Jarrar: Lecture Notes on Introduction to Machine Learning, Birzeit University, 2018 [3] Mustafa Jarrar: Lecture Notes on Decision Tree Machine Learning, Birzeit University, 2018 [4] Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning, Birzeit University, 2018 43