Overview. Linear Regression with One Variable Gradient Descent Linear Regression with Multiple Variables Gradient Descent with Multiple Variables

Size: px

Start display at page:

Download "Overview. Linear Regression with One Variable Gradient Descent Linear Regression with Multiple Variables Gradient Descent with Multiple Variables"

Garry Lloyd
5 years ago
Views:

1 Overview Linear Regression with One Variable Gradient Descent Linear Regression with Multiple Variables Gradient Descent with Multiple Variables

2 Example: Advertising Data Data taken from An Introduction to Statistical Learning with Applications in R ( gareth/isl/datahtml) Data consists of the advertising budgets for three media (TV, radio and newspapers) and the overall sales in 2 different markets TV Radio Newspaper Sales

3 Example: Advertising Data Sales TV Advertising Suppose we want to predict sales in a new area? Predict sales when the TV advertising budget is increased to 35? Draw a line that fits through the data points

4 Some Notation Training data: TV (x) Sales (y) m=number of training examples x= input variable/features y = output variable/ target variable (x (i), y (i) ) the ith training example x (1) = 231, y (1) = 221, x (2) = 445, y (2) = 14

5 Model Representation TV budget x Training Data Learning Algorithm h θ Predicted sales ^ y Prediction: ŷ = h θ (x) = θ + θ 1 x θ, θ 1 are (unknown) parameters sometimes abbreviate h θ (x) to h(x) Sales 15 1 Sales 15 1 Sales TV Advertising θ = 15, θ 1 = TV Advertising θ =, θ 1 = TV Advertising θ = 15, θ 1 = 1

6 Cost Function: How to choose model parameters θ? Prediction: ŷ = h θ (x) = θ + θ 1 x Idea: Choose θ and θ 1 so that h θ (x (i) ) is close to y (i) for each of our training examples (x (i), y (i) ), i = 1,, m Least squares case: select the values for θ and θ 1 that minimise cost function: J(θ, θ 1 ) = 1 m m (h θ (x (i) ) y (i) ) 2 i=1 y y x x

7 Simple Example Suppose our training data consists of just two observations: (1, 3), (2, 1), and to keep things simple we know that θ = The cost function is 2 j=1 (y (j) + θ 1 x (j) ) 2 = 1 2 (1 3θ 1) 2 + (2 1θ 1 ) What value of θ 1 minimises (1 3θ 1 ) 2 + (2 1θ 1 ) 2? y

8 Example: Advertising Data J(, 1 )

9 Example: Advertising Data Least square linear fit Residuals are the difference between the value predicted by the fit and the measured value Do the residuals look random or do they have some structure? Is our model satisfactory? We can use the residuals to estimate a confidence interval for the prediction made by our linear fit We could use cross-validation/bootstrapping to estimate out confidence in the fit itself 3 1 data1 linear 25 linear 5 2 Sales 15 Residual TV advertising TV advertising

10 Summary Hypothesis: h θ (x) = θ + θ 1 x Parameters: θ, θ 1 Cost Function: J(θ, θ 1 ) = 1 m m i=1 (h θ(x (i) ) y (i) ) 2 Goal: Select θ and θ 1 that minimise J(θ, θ 1 )

11 Gradient Descent Need to select θ and θ 1 that minimise J(θ, θ 1 ) Brute force search over pairs of values of θ and θ 1 is inefficient, can we be smarter? Start with some θ and θ 1 Repeat: Update θ and θ 1 to new value which makes J(θ, θ 1 ) smaller J(, 1 ) When curve is bowl shaped or convex then this must eventually find the minimum

12 Gradient Descent Start with some θ and θ 1 Repeat: Update θ and θ 1 to new value which makes J(θ, θ 1 ) smaller When curve has several minima then we can t be sure which we will converge to Might converge to a local minimum, not the global minimum

13 Gradient Descent Repeat: Update θ and θ 1 to new value which makes J(θ, θ 1 ) smaller One option: carry out local search of θ and θ 1 to find one that decreases J Another option: gradient descent: temp := θ α θ J(θ, θ 1 ) temp1 := θ 1 α θ 1 J(θ, θ 1 ) θ := temp, θ 1 := temp1 J(θ,θ 1 ) θ θ J(θ, θ 1 ) J(θ+δ,θ1) J(θ,θ1) δ for δ sufficiently small J(θ + δ, θ 1 ) J(θ, θ 1 ) + δ θ J(θ, θ 1 ) When δ = α θ J(θ, θ 1 ) then J(θ + δ, θ 1 ) ( ) 2 J(θ, θ 1 ) α θ J(θ, θ 1 )

14 Gradient Descent J(θ,θ 1 ) J(θ,θ 1 ) θ θ Selecting step size α too small will mean it takes a long time to converge to minimum But selecting α too large can lead to us overshooting the minimum We need to adjust α so that algorithm converges in a reasonable time

15 Gradient Descent For J(θ, θ 1 ) = 1 m m i=1 (h θ(x (i) ) y (i) ) 2 with h θ (x) = θ + θ 1 x: θ J(θ, θ 1 ) = 2 m m i=1 (h θ(x (i) ) y (i) ) θ 1 J(θ, θ 1 ) = 2 m m i=1 (h θ(x (i) ) y (i) )x (i) So gradient descent algorithm is: repeat: temp := θ 2α m m i=1 (h θ(x (i) ) y (i) ) temp1 := θ 1 2α m m i=1 (h θ(x (i) ) y (i) )x (i) θ := temp, θ 1 := temp1

16 Linear Algebra Review Its assumed you know basic linear algebra for this module There is lots of revision material online eg (coursera linear algebra review) Basic notation: [ ] 231 Vector x =, element x = 231 [ ] 1 2 Matrix A =, element A = 1 Transpose x T = [ ] Inner product x T y = n i=1 x iy i for two vectors with n elements Produt of a matrix and a vector Ax, product of two matrices AB

17 Linear Regression with Multiple Variables Advertising example: TV Radio Newspaper Sales x 1 x 2 x 3 y n=number of features (3 in this example) (x (i), y (i) ) the ith training example eg 231 x (1) = [231, 378, 692] T = x (i) j is feature j in the ith training example, eg x (1) 2 = 378

18 Linear Regression with Multiple Variables Hypothesis: h θ (x) = θ + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 eg h θ (x) = x 1 5 x }{{}}{{} 2 +1 x }{{}}{{} 3 Sales TV Radio Newspaper For convenience, define x = 1 ie x (1) = 1, x (2) = 1 etc x x 1 Feature vector x = x 2 x n Parameter vector θ = θ θ 1 θ 2 θ n h θ (x) = θ + θ 1 x 1 + θ 2 x θ n x n = θ T x

19 Linear Regression with Multiple Variables Hypothesis: h θ (x) = θ T x (with θ, x now n + 1-dimensional vectors) Cost Function: J(θ, θ 1,, θ n ) = J(θ) = 1 m m i=1 (h θ(x (i) ) y (i) ) 2 Goal: Select θ that minimises J(θ) As before, can find θ using: Start with some θ Repeat: Update vector θ to new value which makes J(θ) smaller eg using gradient descent: Start with some θ Repeat: for j= to n {tempj := θ j α θ j J(θ)} for j= to n {θ j := tempj}

20 Gradient Descent with Multiple Variables For J(θ) = m i=1 (h θ(x (i) ) y (i) ) 2 with h θ (x) = θ + θ 1 x θ n x n : θ J(θ) = 2 m m i=1 (h θ(x (i) ) y (i) ) θ 1 J(θ) = 2 m m i=1 (h θ(x (i) ) y (i) )x (i) 1 θ j J(θ) = 2 m m i=1 (h θ(x (i) ) y (i) )x (i) j So gradient descent algorithm is: Start with some θ Repeat: for j= to n {tempj := θ j 2α m for j= to n {θ j := tempj} m i=1 (h θ(x (i) ) y (i) )x (i) j }

21 Example: Advertising Data How is the impact of the advertising spend on TV and radio related, if at all? Perhaps a quadratic fit would be better? If so, what does that imply for how we allocate our advertising budget? Sales Sales TV advertising Radio advertising TV advertising 3 6 Radio advertising

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated