Lecture 11 Linear regression

Size: px

Start display at page:

Download "Lecture 11 Linear regression"

Elfreda Owens
5 years ago
Views:

1 Advanced Algorithms Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year Lecture 11 Linear regression These slides are taken from Andrew Ng, Machine Learning on Coursera - 1

2 Machine Learning definition Arthur Samuel (1959) Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed The Samuel Checkers-playing Program appears to be the world's first self-learning program Machine Learning definition Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E 2

3 A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Suppose your program watches which s you do or do not mark as spam, and based on that learns how to better filter spam. What is the task T in this setting? q Classifying s as spam or not spam q Watching you label s as spam or not spam. q The number (or fraction) of s correctly classified as spam/not spam. q None of the above this is not a machine learning problem A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Suppose your program watches which s you do or do not mark as spam, and based on that learns how to better filter spam. What is the task T in this setting? T E P q Classifying s as spam or not spam q Watching you label s as spam or not spam. q The number (or fraction) of s correctly classified as spam/not spam. q None of the above this is not a machine learning problem 3

Examples of ML applications (1) Email spam filtering T: to predict

predicted E: a set of emails idenafied as spam/non spam for the

predefined categories P: % of Web pages correctly categorized E: a

? no yes Examples of ML applications (2) HandwriAng recogniaon T:

words correctly classified E: a database of handwriken words with

public highways using vision sensors P: average distance traveled

images and steering commands recorded while observing a human

4 Examples of ML applications (1) spam filtering T: to predict which are spam for a given user P: % of s correctly predicted E: a set of s idenafied as spam/non spam for the user Web pages categorizaaon T: to categorize Web pages in predefined categories P: % of Web pages correctly categorized E: a set of Web pages with specified categories spam? Which cat.? no yes Examples of ML applications (2) HandwriAng recogniaon T: to recognize and classify handwriken words within images P: % of words correctly classified E: a database of handwriken words with given classificaaons (i.e., labels) Robot driving T: to drive on public highways using vision sensors P: average distance traveled before an error (as judged by human observer) E: a sequence of images and steering commands recorded while observing a human driver Which word? Which steering command? we do in the right way Go straight Move left Move right Slow down Speed up 4

Examples of ML applications (3) Medical diagnosis T: suggest treatments to doctors P: % of suggesaons accepted by the

recommend people movies they like P: overall saasfacaon of the users E: movie raangs given by the users of the system

g., clustering) In this lecture 1 algorithm for supervised learning Linear regression These topics are expanded in the

5 Examples of ML applications (3) Medical diagnosis T: suggest treatments to doctors P: % of suggesaons accepted by the doctors E: a database of electronic health records including <observaaons,treatment> pairs Movie recommendaaon T: to recommend people movies they like P: overall saasfacaon of the users E: movie raangs given by the users of the system Which movie? Which treatment? t1 t2 t3 t4 t5 t6 Machine learning algorithms Supervised learning Unsupervised learning (e.g., clustering) In this lecture 1 algorithm for supervised learning Linear regression These topics are expanded in the courses: Data Mining by Dr. Mouna Kacimi (second semester) Information Search and Retrieval by Prof. Francesco Ricci (second semester) 5

6 Supervised learning Housing price prediction Price ($) in 1000 s Size in feet 2 Supervised Learning right answers given Regression: Predict continuous valued output (price) Supervised learning Breast cancer (malignant, benign) 1(Y) Classification Discrete valued output (0 or 1) Malignant? 0(N) Tumor size Tumor s ize In this learning problem we consider just one feature, i.e., the tumor size 6

7 Supervised learning Age Tumor size In this learning problem we consider two features, i.e., the tumor size, and the age In general, in real cases there are much more features. E.g., Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape You re running a company, and you want to develop learning algorithms to address each of two problems Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months Problem 2: You d like software to examine individual customer accounts, and for each account decide if it has been hacked/ compromised Should you treat these as classification or as regression problems? q Treat both as classification problems q Treat problem 1 as a classification problem, problem 2 as a regression problem q Treat problem 1 as a regression problem, problem 2 as a classification problem q Treat both as regression problems 7

8 You re running a company, and you want to develop learning algorithms to address each of two problems Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months Problem 2: You d like software to examine individual customer accounts, and for each account decide if it has been hacked/ compromised Should you treat these as classification or as regression problems? q Treat both as classification problems q Treat problem 1 as a classification problem, problem 2 as a regression problem q Treat problem 1 as a regression problem, problem 2 as a classification problem 0 - not hacked q Treat both as regression problems 1 - hacked Linear regression with one variable 8

9 Model representation Housing Prices (Portland, OR) Price (in 1000s of dollars) Size (feet 2 ) Supervised Learning Given the right answer for each example in the data Regression Problem Predict real-valued output Model representation Training set of housing prices (Portland, OR) Price ($) in 1000's (y) Size in feet 2 (x) m=47 Notation: m = Number of training examples x s = input variable / features y s = output variable / target variable (x, y) = one training example (x (i), y (i) )= i th training example E.g., x (1) =2104 x (2) =1416 y (1) =460 (x (3), y (3) ) = (1534, 315) 9

10 Model representation How do we represent h? Training Set shorthand: h(x) Size of house x Learning Algorithm h hypothesis h maps from x s to y s Estimated price (estimated value of y) y x Linear regression with one variable. Univariate linear regression. Cost function Training set Size in feet 2 (x) Price ($) in 1000's (y) m=47 Hypothesis: s: Parameters How to choose s? 10

11 Cost function h(x) = 1+0.5*x 2 h(x) = *x 2 h(x) = 0.5*x

12 Cost function y Idea: Choose so that is close to for our training examples x Cost function Goal: is also called squared error function 12

13 Cost function intuition 1 Hypothesis: Simplified Parameters: h(x) h(x) Cost Function: Goal: Cost function intuition 1 (for fixed, this is a function of x) (function of the parameter ) 3 2 y x = J(1)= 0 13

14 Cost function intuition 1 (for fixed, this is a function of x) (function of the parameter ) y x J(0.5)= 0.58 = 1/(2*3) * [(0.5-1) 2 +(1-2) 2 +(1.5-3) 2 ] 0.58 Cost function intuition 1 (for fixed, this is a function of x) (function of the parameter ) 3 y x 2 3 = 1/(2*3) * [ ] J(0)=

15 3 3 15

16 Cost function intuition 2 Hypothesis: Parameters: Cost Function: Goal: Cost function intuition 2 (for fixed, this is a function of x) (function of the parameters ) 500 Price ($) in 1000 s Size in feet 2 (x)

17 Cost function intuition 2 (for fixed, this is a function of x) (function of the parameters ) Contour plot h(x) = *x = K Cost function intuition 2 (for fixed, this is a function of x) (function of the parameters ) h(x) = * x

Cost function intuition 2 (for fixed, this is a function of x) (function of the parameters ) h(x) Gradient descent Have some function Want More in general:

18 Cost function intuition 2 (for fixed, this is a function of x) (function of the parameters ) h(x) Gradient descent Have some function Want More in general: J(θ 0, θ 1,., θ n ) min θ J(θ 0, θ 1,., θ n ) Outline: Start with some (e.g., θ 0 = 0, θ 1 = 0) Keep changing to reduce until we hopefully end up at a minimum 18

19 Gradient descent J(θ 0,θ 1 ) θ 1 θ 0 Gradient descent J(θ 0,θ 1 ) θ 0 θ 1 19

20 Gradient descent algorithm Learning rate Correct: Simultaneous update Incorrect: 20

21 Gradient descent intuition Gradient descent algorithm Learning rate Derivative Simplification min J(θ 1 ),θ 1 R θ 1 21

22 Gradient descent intuition J(θ 1 ),θ 1 R θ 1 :=θ 1 α d dθ 1 J(θ 1 ) > 0 θ 1 :=θ 1 α (positive number) θ 1 d dθ 1 J(θ 1 ) 0 θ 1 :=θ 1 α (negative number) θ 1 Gradient descent intuition J(θ 1 ) If α is too small, gradient descent can be slow If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. J(θ 1 ) 22

23 What if is already at the local minimum? at local optima = 0 Current value of θ 1 :=θ 1 α * 0 unchanged Gradient descent intuition Gradient descent can converge to a local minimum, even with the learning rate α fixed As we approach a local minimum, gradient descent will automatically take smaller steps So, no need to decrease α over time 23

Gradient descent for linear regression 1 θ j 2m = 1 θ j 2m m 2 ( h θ (x (i) ) y (i) ) = i=1 m (

24 Gradient descent for linear regression Gradient descent algorithm Linear Regression Model Key term We apply gradient descent to minimize J(θ 0,θ 1 ) : min J(θ 0,θ 1 ) θ 0,θ 1 R θ 0,θ 1 Gradient descent for linear regression 1 θ j 2m = 1 θ j 2m m 2 ( h θ (x (i) ) y (i) ) = i=1 m ( θ 0 +θ 1 x (i) y (i)) i=1 2 m 1 h θ (x (i) ) y (i) m i=1 ( ) m 1 ( h θ (x (i) ) y (i) ) x (i) m i=1 24

Gradient descent for linear regression Gradient descent algorithm update and simultaneously Gradient descent for linear regression In general, the

25 Gradient descent for linear regression Gradient descent algorithm update and simultaneously Gradient descent for linear regression In general, the local optimum of the cost function J found by the gradient descent algorithm depends on the initialization of the parameters J(θ 0,θ 1 ) θ 0 θ 1 25

26 Gradient descent for linear regression The cost function J for liner regression is always a convex (or bowl-shaped ) function Only 1 global optimum Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) h(x) = *x 26

27 Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) 27

28 Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) 28

29 Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) 29

30 Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) Gradient descent for linear regression (for fixed, this is a function of x) (function of the parameters ) 30

31 Gradient descent for linear regression Batch Gradient Descent Each step of gradient descent uses all training examples in calculating the derivatives There are variants: Stochastic: uses 1 example in each iteration Faster algorithm, but may not reach the global minimum, even tough it gets close to it Mini-batch: uses b examples in each iteration, (1<b<m) 31

32 Linear regression with multiple variables 32

33 Multiple features Size (feet 2 ) Price ($1000) Single feature (variable) Multiple features (variables) Size (feet 2 ) Number of bedrooms Number of floors Age of home (years) Price ($1000) Multiple features Multiple features (variables) Size (feet 2 ) Number of Number of Age of home Price ($1000) bedrooms floors (years) x 1 x 2 x 3 x 4 y m=47 n=4 Notation: n = number of features x (i) = input (features) of i th training example x (i) j = value of feature j in ith training example! # x (2) = # # # " $ & & & & % x 3 (2) = 2 33

Multiple features Hypothesis: Previously: Now: h θ (x) =θ 0 +θ 1 x 1 +θ 2 x 2 +θ 3 x 3 +θ 4 x 4 E.g.: h θ (x) = 80 + 0.1* x 1 + 0.

34 Multiple features Hypothesis: Previously: Now: h θ (x) =θ 0 +θ 1 x 1 +θ 2 x 2 +θ 3 x 3 +θ 4 x 4 E.g.: h θ (x) = * x * x 2 + 3* x 3 2 * x 4 size # bedrooms # floors age Multiple features For convenience of notation, define x 0 (i) =1! # # # x = # # # "# x 0 x 1 x 2 x n $! & # & # & & R n+1 # θ = # & # & # %& "# θ 0 θ 1 θ 2 θ n $ & & & & R n+1 & & %& =1 h θ (x) =θ 0 x 0 +θ 1 x 1 + +θ n x n =θ T x Multivariate linear regression 34

dimensional vector Cost function: Gradient

Repeat (simultaneously update ) for E.g.

35 Gradient descent for multiple variables Hypothesis: =1 Parameters: θ R n+1 n+1 dimensional vector Cost function: Gradient descent: Repeat J(θ) J(θ) (simultaneously update for every ) Gradient descent for multiple variables Gradient Descent Previously (n=1): Repeat New algorithm : Repeat (simultaneously update ) for E.g., n=2 (simultaneously update ) 35

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018 Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018 Version 1 Machine Learning Linear Regression Mustafa Jarrar Birzeit University 1 Watch this lecture and download the slides Course